Making a News Aggregator Using Natural Language Processing(NLP)

A Tutorial in Creating an NLP Model

Jordan Tan-say
8 min readOct 8, 2021
Source: Pixabay

In this article, I’ll be explaining my process in creating a Natural Language Processing (NLP) news aggregator which will help our computer parse through news headlines and filter them into one of two categories: Political or Non-political.*

*As a disclaimer, this project was originally supposed to organize these news headlines into their original classification involving at least 13 different categories (ranging from Entertainment to Sports to World Events, etc.) however due to limitations which we will be going over at the end of this article, this model will only document categorizing between Political or Non-political — because that’s all that really matters.

Although, before we get into the news aggregator, I should first explain what Natural Language Processing even is!

Natural Language Processing

According to IBM, “Natural language processing (NLP) refers to the branch of computer science — and more specifically, the branch of artificial intelligence or AI — concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.” Even if we’ve never heard of NLPs, they exist in almost everything we do and can be used in a number of different ways including

  • Translation
  • Autocorrect
  • Autocomplete
  • Conversational AI
  • Automated speech/voice recognition
  • And much more!

This project isn’t any different as we’ll also be training a model to understand text and spoken words much like us humans. Now that we know just how important NLPs are, we can dive right into the news aggregator!

Making the News Aggregator

Basic Imports

First thing’s first, I started off by including all the general imports I would be using for the project which involves pandas (our main data analysis library), matplotlib and seaborn (for creating data visualizations), wordcloud (to create NLP specific visualizations), sklearn (our main machine learning library), and most importantly (Natural Language Toolkit)NLTK (a popular open-source package in Python that provides for all common NLP tasks).

Data Loading and Preprocessing

After imports, I then loaded our training dataset. The dataset that I worked with involved real-world text data provided by Kaggle which included corpuses of news articles from the year 2012 to 2018 obtained from HuffPost. It’s with this dataset that I’ll be applying (long short-term memory) LSTM networks and attempting to determine whether each article is political or non-political from their headlines.

Kaggle dataset stats

As we can see, our dataset is huge! We have about 100,000 different news articles and 41 different categories, not to mention the number of articles per category is extremely disparate!

Dataset after simplification

To create an unbiased model that won’t take multiple hours to run, we’ll simply our dataset by making every category even except our “Politics” category and adding them up to create our “Non-politics” category which ended up looking like this. Afterwards, I eliminated the unneeded information such as authors and content, and proceeded to track only headlines and categories.

Language Processing for Data Ingestion

First thing to do now that we have our dataset extracted is to clean up as much erroneous information from our headline data in order to make our patterns and target labels as explicit and non-noisy as possible

These antipatterns include the following:

  • Variances in syntactic capitalization.
  • Symbols and other non-alphanumeric cases.
  • Non-signal-related digits and numbers.
  • Stop Words: generally noisy terms and tokens.
  • Different inflections of words that could be interpreted differently.

After clean up, our dataset looked something like this:

Processed dataset in tabular form

As we can see from above, our dataset will now only contain signal-expressive terms as well as their corresponding category.

Before we get to modeling, however, let’s do some more sanity checks and peek into what the model thinks about our dataset.

First thing we’ll do is create a word cloud: a visualized format of conceptualizing the most frequent term occurrences in our headlines to better understand our tokenized distribution. The larger the word, the more common the term is.

(left) word cloud describing entire dataset, (right) word clouds describing each category

Looking good! We can see from above, the main word cloud describing our dataset as well as the two differing word clouds for both the non-politics and politics categories. Notice that even in the category of non-politics, we still see “trump” right under “photo” as a common word. We can keep that in mind as that might play a part in hindering the performance of our model, for instance, if our model comes across the word “trump”, it might be confused whether or not that article belongs in politics or non-politics. If it wasn’t for these word clouds, we might not have been able to identify this!

As we move on, we need to ready our processed data in a way for our machine to understand them. As it stands, our machine cannot comprehend what “POLITICS” means or any other category for that matter, however we can use a label encoder to properly vectorize our target categories essentially turning our targets into 1s and 0s like this:

Label-encoded dataset in tabular form (0s meaning non-politics and 1s meaning politics)

Superb! Now that our categories are properly tokenized and ready for data ingestion, we can do the same for our actual input data: the headlines. For that, we’ll use a tokenizer object provided by NLTK.

array([[  0,   0,   0, ..., 495, 127,  32],
[ 0, 0, 0, ..., 0, 28, 44],
[ 0, 0, 0, ..., 179, 281, 61],
...,
[ 0, 0, 0, ..., 0, 24, 9],
[ 0, 0, 0, ..., 90, 396, 250],
[ 0, 0, 0, ..., 0, 108, 232]], dtype=int32)

We can see from above that our data has been properly tokenized and now all our text-based headlines have been transformed into digestible, frequency-based models for our machine to train with.

Last but not least, we setup our target labels to be ingestible as well and perform our train-test-splitting to generate the datasets that our model will train-test with.

X_train, X_test, y_train, y_test = train_test_split(X, y, 
train_size=0.7,
test_size=0.3,
random_state=42)

Data Ingestion and Predictive Modeling

It’s finally time to setup our model.

We’ll be designing a sequenced-based learning model to extract our vectorized text data which, in this case, involves using a long short-term memory model which is simply a higher-order recurrent neural network. This will be used to properly ingest and retain signal comprehension across our data.

Specifically, we’ll use the following layer specifications:

  • An Embedding layer to properly vectorize our term inputs for signal extraction.
  • A Spatial Dropout layer that effectively performs dropout regularization on vectorized text data that prevents overfitting.
  • Two LSTMs in sequence to extract more internal heuristics. (These models are preinitialized with dropout regularization.)
  • A Dense (connective-predictive) layer to get our output classification.

Our model ends up looking like this:

Model summary

Next, we’ll compile our model with appropriate loss functions, optimizers, and accuracy extraction metrics.

We’ll be using categorical cross-entropy as our loss function and a boosted form of Adam optimization called Nesterov-Boosted Adam, or Nadam, to get optimal model results. It isn’t that important to understand what categorical cross-entropy or Nesterov-Boosted Adam means, but know that both the loss function and the optimization function are used to calculate and reduce loss, therefore increasing accuracy to our model.

# Compile Model with Specified Loss and Optimization Functions
model.compile(loss="categorical_crossentropy",
optimizer="nadam",
metrics=["accuracy"])

We’ll also initialize some final fitness parameters, including our batch size for gradient descent optimization and our total epochs for runtime.

# Define Batch Size and Epochs as Hyperparameters
batch_size, epochs = 32, 10

Now we’re ready to run and evaluate our model!

Running our model!

Interpreting Results

For extra data visualization, we’ll create a confusion matrix to show how accurately we performed for our discrete predictions.

Model’s confusion matrix

Confusion matrices describe in a visual manner how many data entries were correctly predicted as well as how many data entries were incorrectly predicted during the model’s testing phase. For instance, the confusion matrix to the left tells us that our model could correctly predict 6729 out of 7800 non-political articles and could correctly predict 6371 out of 8086 political articles.

Our second additional visualization will be that of a classification report to clearly display precision, recall, accuracy, and F1 metrics across all labels and for our dataset overall.

Model’s classification re

Our report shows that our model had an 82% accuracy rate which means that it performed alright on determining both political and non-political headlines, but not perfectly. Going back to our word cloud, we notice that our performance might have been hindered by some common words that showed up in both politics and non-politics.

The Challenges

I had noted in the beginning of this article that this project was originally supposed to organize these news headlines into their original classification involving at least 13 different categories (ranging from Entertainment to Sports to World Events, etc.) however there were some complications when dealing with inaccuracies produced by the model.

For example, my previous attempt resulted in a confusion matrix and classification report such as this:

Original Model’s confusion matrix
Original Model’s classification report

It can be seen that the original model (with an insane looking confusion matrix and an accuracy rating of 37%) was not going to pass as a working news aggregator at all.

Future Plans

Although this current version of a news aggregator held moderate success, there could be a lot more done to make this project much better. One such suggestion could be to do extra hyper-parameter tuning in order to create a much more accurate, more optimal model (at least 90% accuracy would be excellent). Another future goal is to fix the original intention of this project and create a perfectly accurate news aggregator that can classify over more than two target categories.

There are literally an infinite number of possibilities to further analyze these sets of data with the power of data science, but for now this serves as a base for Making a News Aggregator using Natural Language Processing(NLP)!

--

--