A Beginner’s Guide to Naive Bayes
Naive Bayes is a technique used to construct classifiers. It relies on the assumption that its features are conditionally independent from one another. This approach is often applied despite the fact that the features may not be completely independent, and it still gives good results, therefore, it is called naive. Naive Bayes can be very useful, specially for text classification. It is a relatively simple method, and it benefits from having a lot of data fed into it.
In this blog post, I am going to use Naive Byes to classify movie reviews as either poisitve or negative. The dataset was obtained from kaggle (https://www.kaggle.com/praveenkotha2/end-to-end-text-processing-for-beginners).
Let’s jump right into it!
I started by importing a few libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlineimport string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from nltk.corpus import wordnetfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.pipeline import Pipelinefrom sklearn.naive_bayes import MultinomialNB
Next, I Imported the training and test sets and put them into dataframes.
train = pd.read_csv(‘clean_train_sample.csv’, index_col=0)
test = pd.read_csv(‘clean_test_sample.csv’, index_col=0)
After looking at a few reviews from my training set, I built a function called clean_text, to prepared the data for modelling. During the cleaning process, I took out the html tags, punctuation, stopwords and had the words converted into their roots to try to minimize the number of words that I was going to use for my model training.
def clean_text(text):
lower = [word.lower() for word in text]
lower = ‘’.join(lower)
#remove html
nohtml = soup = BeautifulSoup(lower, ‘lxml’)
nohtml = soup.get_text()
#remove punctuation
nopunc = [char for char in nohtml if char not in string.punctuation]
nopunc = ‘’.join(nopunc)
#remove stopwords
nostop = [word for word in nopunc.split() if word not in stopwords.words(‘english’)]
#nostop = ‘’.join(nostop)
#Lemmatization: converting the words into its roots
root = [WordNetLemmatizer().lemmatize(word) for word in nostop]
return list(set(root))
The clean_text function, returned a list of tokens. Tokens is just a list with the strings from the text.
I then created my bag of words (bow). This technique involves turning the tokens into the features of interest. Basically, each column represents a word and each row represents one review. The bow was created using CountVectorizer, which returns the counts of words per document, in other words, each column reflects the number of times each word in the bow shows in the review. I am also using ngram_range; meaning that the bow is going to be represented of one word and three words (the latter is to add more context to the words).
ngram_min = 1
ngram_max = 3
bow = CountVectorizer(analyzer=clean_text, ngram_range=(ngram_min,ngram_max)).fit_transform(train[‘Review’])
To normalize the bow, I used the term frequency inverse document frequency (TF-IDF). This method helps ensure that larger reviews do not have stronger relationships to the target compared to smaller reviews, and also that words that appear more frequently do not contribute as much since they stop being a distinguishing factor among the reviews.
tfidf_transformer = TfidfTransformer().fit_transform(bow)
For model training, I used the MultinomialNB. This classifier is appropriate for discrete features. It applies Bayes’ theorem for every feature. I am going to give a little insight on how Bayes’ theorem works by using positive reviews as an example:
First, it estimates the probability of the review being positive:
Second, it estimates the probability of the word being in the review given that the review is positive:
Finally, it estimates the probability of being positive given a review:
pipeline = Pipeline([
(‘bow’, CountVectorizer(analyzer=clean_text, ngram_range=(ngram_min,ngram_max))),
(‘tfidf’, TfidfTransformer()),
(‘classifier’, MultinomialNB()),
])pipeline.fit(train[‘Review’], train[‘Label’])y_pred = pipeline.predict(test[‘Review’])
print(confusion_matrix(test[‘Label’],y_pred))
print(classification_report(test[‘Label’],y_pred))
I then created a pipeline to manage all the above steps, to avoid repeating them over and over. The pipeline acts as a learner, which allowed me to fit my training set and then run the predictions on it.
I hope this is helpful!
References:
Fenner, Mark E. Machine Learning with Python for Everyone. Addison-Wesley, 2019.