Hello everyone! I hope you all are doing good. In this blog, we’ll implement one of the most talked-about concepts in Machine Learning- Natural Language Processing.
Being a Machine Learning enthusiast, I try to explore different ML models, compare their accuracies and find their use-cases. Among them, I found NLP to be more interesting than the others. If you’ve never worked with NLP before, this article will give you a headstart!
You can easily follow the blog with a basic knowledge of python. By the end of this blog, you’ll be familiar with one of the most basic applications of NLP, following which you can explore further as you wish.
First of all, what is NLP? Natural Language Processing is one of the most talked-about concepts in Machine Learning. It is the ability of a computer to understand, analyze and potentially generate human language.
Here are some of the applications of NLP:
- Information retrieval (Google search)
- Spam filters (Such as in Gmail)
- OCRs (Google Vision, etc)
- Intent classification (As in most of the ‘AI’ bots today)
Our dataset will be a collection of 1000 reviews of a restaurant. We’ll use NLP to predict whether a review is positive or negative. This is called ‘Sentiment Analysis’ or ‘Emotional Analysis’ and is extensively used in FinTech.
Before we start, you’ll need the following:
- A code editor (Preferably Spyder or Jupyter Notebook)
- The dataset (https://github.com/adityabisoi/Review-predictor/blob/master/Restaurant_Reviews.tsv)
- Basic knowledge of ML is good, not compulsory!
Training the model
To build the model, we perform the following:
- Importing the dataset
- Cleaning the text
- Creating a ‘Bag of Words’
- Training and classification
Importing the dataset
The dataset is a .tsv (Tab Separated Values) file, with two columns- one with the reviews and another with the review class, i.e., positive (1) or negative (0).
We import the dataset with the Pandas library. The parameter delimiter is used to indicate that tab acts as a separator between reviews and their class. Quoting is used to remove the quotes (“) in the review, which may hinder further processing.
Cleaning the text
We need to pre-process our data by removing any vague information. For example, we don’t need words such as ‘the,’ ‘and,’ ‘a’ in our text since they do not help in determining whether the review is good or bad. These words are called stopwords. Next, we apply stemming, which is converting all the forms of expression to its root form. For example, ‘loved,’ ‘loving’ to its lemma ‘love.’
Creating a ‘Bag of Words’
Next, we apply vectorization to convert the reviews into a numerical format. We create a sparse matrix containing individual reviews as rows and each word of the reviews as columns. We call this the Bag of Words. Our text is now ready for training.
Training and classification
The data is split into training and testing sets. The classification models which can be applied to distinguish the reviews are many. But, we use Naive Bayes here, which gives higher accuracy, among others.
Naive Bayes is a classifier working on Bayes theorem of probability. It assumes each feature of a dataset as independent ones. It can be extremely faster than other classifiers. You can, of course, try this with other classifiers too.
To see the results of our work, we build a confusion matrix, which shows the number of correct predictions, as well as false positives and false negatives.
So, from the matrix, we see that our model has an accuracy of 77%. It has 42 false positives and 12 false negatives. Although the accuracy may seem to be low, it is pretty good for the input of 1000 reviews. With an increase in the number of reviews, the accuracy of the model will increase.
So, here we are at the end, cheers for following me until here! This is one of the most basic implementations of NLP. There’s a lot to explore out there. Although the steps above may seem intimidating at first, you can create exciting applications once you get into it.
Here is the repo for the complete code: https://github.com/adityabisoi/Review-predictor. Feel free to reach out to me in case you get any doubts.
To know more you can checkout the following links: