Sentimental Reviews Analysis Using NBC

5 min readDec 1, 2021

What is Naive Bayes classification?

It is a classification technique which works on Bayes theorem of probability to predict the class of unknown data sets. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

Assignment details

I have used Sentimental labeled sentences datasets from kaggle as a part of my assignment. if you want to try it out i will provide link here so you can find dataset easily. click here DATA

i have not used any types of libraries in this task so it was little difficult for start and understand how to do it. Kaggle have lots of code books, we can learn from that code and try to increase Accuracy from previous code book.

Code

Data cleaning and pre-process Datasets

as we have used 2 different datasets we need to remove unwanted numbers and symboles which is used in reviews. After finishing cleaning part we will merge this both datasets.

I have used 2 datasets because NBC works good on large datasets so i combined both datasets and made 1 large datasets. We will compare my Accuracy with the old source code Accuracy.

Splitting datasets into Train_dataset, Dev_dataset, and Test_datasets

I have split datasets between Train dataset, Dev dataset and Test datasets. Train datasets is upto 60% of the dataset, Dev dataset is upto 20% of the unseen data from train dataset, and in Test dataset it also same as 20% of unseen data.

Build a vocabulary

Vocabulary will be devided into 3 parts (All words, Positive words, Negative words) and that will contain 3 parts in it example if we want a perticular word occurance then it will be like {“Happy”:10,14,24} 10 is positive occurrence and 14 is negative occurrence and in last 24 is over all occurrence. like wise

Probability of occurrence and conditional probability of the word

The goal here is to find the probability of the particular word and conditional probability of word and we will find it using this formula as p(word)=(num. of word in documents)/(number of all documents) and if we found the word we want in the sentence then increment by 1.

Calculate 5 K-fold Accuracy using Dev dataset

K-fold accuracy validation is a method for dataset which will split into k number of section or folds. Here we are using k=5 so we will divide section into 5 times or 5 folds. What we will do for for train model is in first iteration we will test model and other folds will train model like wise for second iteration we will keep 2 fold as test and other folds for train models. likewise all 5 fold will help to train and test model. We have split folds equally. In one fold we have 70 sentence and and likewise 2 =70 sentence, 3 = 70 sentences, 4 = 70 sentences, 5 = 70 sentences. with sentiments with it.

We will do Smoothing on this model to achieve extra Accuracy

I have used Laplace Estimate technique as a smoothing which use alpha as the smoothing parameter. It works like if the value of alpha is 0 then the probability of the smoothing will also be 0. For calculating Accuracy with smoothing depends on 4 arguments it goes like smoothing parameters, sentence, count and Vocabulary. and the formula used is (Vocabulary [word][label]+alpha)/(label_count[label]+(alpha*len(Vocabulary))) and it will give both positive and negative possibility.

As you can see that my code is giving Accuracy of dev Without smoothing as = 37%

and Accuracy with smoothing = 65%

We will compare my Accuracy with reference code

As you can see reference Accuracy without smoothing = 52% But

Reference Accuracy with smoothing = 57%

So overall i have improved Accuracy by 8 to 9 %

Contribution

Datasets: I have used 2 different datasets and merge both datasets after cleaning datasets. I have done this because it will make a large dataset and it will be helpful in increasing my Accuracy.

Splitting datasets: I have split my datasets as 60% in train dataset 20% in dev dataset and 20% in test dataset. I have tried different ratios for dev and test dataset but i was loosing my Accuracy. i have tried with 10%, 10% AND 15%, 15% as dev and test dataset ratios. But for this model i think this 60%, 20%, 20% is best Accuracy.

Difficulty in solving : I found difficulty in implementing k fold validation. but i have understand concepts how k fold works. and i will try to do my best next time.

Splitting dataset for better Accuracy was a challenging part .

Building Vocabulary was really tough and hard to understand because we are not allowed to use Libraries and this types of coding is new to me so i think that part is my weak point.

This bar graph clearly show that my accuracy is more then Reference accuracy.

Link for my Kaggle source book: https://www.kaggle.com/darshanpatel3738/patel03

References

https://www.kaggle.com/yaswanthjk/sentiment-analysis
https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558
https://www.geeksforgeeks.org/pandas-dataframe-iterrows-function-in-python/
https://www.kaggle.com/vijay420/sentiment-analysis
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c