Naive Bayes Algorithm

Naive Bayes – a Not so Naive Algorithm

Naive Bayes algorithm is called Naive because the algorithm makes a very strong assumption about the data having features independent of each other while in reality, they may be dependent in some way. Assumes that the presence of one feature in a class is completely unrelated to the presence of all other features. If this assumption of independence holds, Naive Bayes performs extremely well and often better than other models.

Naive Bayes can also be used with continuous features but is more suited to categorical variables, specially in the cases where all the features are categorical

Implementing Naive Bayes in R

R supports a package called ‘e1071’ which provides the naive bayes training function. Also there are implementations of Bayes function in ‘caret’ package and ‘mlr’ package.

Improving performance

The only way to improve the results obtained using Bayes functions is to have more features or more data. Using different implementations very frequently do not offer improvements.

Technical implementation

I would like to divide the technical implemetation in two part: working with the data in order to create a completely new set of data with a new format, and the creating and analysis of the model.

Working with the data

## Step 1. Get data
url_data <- "https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2004/sms_spam.csv"
sms_raw <- read_csv(url(url_data))

## Step 2. Analyzing data
## 2.1 Checking the type variable more carefully
sms_raw$type <- factor(sms_raw$type)

## 2.2 Checking the type variable more carefully
str(sms_raw$type)
##  Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...

## 2.3. Creating the Volatile Corpus
sms_corpus <- VCorpus(VectorSource(sms_raw$text))  

## 2.4 Cleaning up the corpus using tm_map()
corpus_clean <- tm_map(sms_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

## 2.5 Building a corpus using the text mining (tm) package
sms_corpus <- Corpus(VectorSource(sms_raw$text))

## 2.6 Creating a document-term sparse matrix
sms_dtm <- DocumentTermMatrix(corpus_clean)

## 2.7 Creating training and test datasets
sms_raw_train <- sms_raw[1:4169, ]
sms_raw_test  <- sms_raw[4170:5559, ]

sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test  <- sms_dtm[4170:5559, ]

sms_corpus_train <- corpus_clean[1:4169]
sms_corpus_test  <- corpus_clean[4170:5559]

## 2.8 Checking that the proportion of spam is similar
prop.table(table(sms_raw_train$type))
prop.table(table(sms_raw_test$type))

# word cloud visualization
findFreqTerms(sms_dtm_train, 5)
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)

sms_dtm_freq_train<- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
                   convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
                  convert_counts)

II. Creating and analysis of the model

## Step 3: Creating model 
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_test_pred <- predict(sms_classifier, sms_test)

## Step 4: Evaluating the model
CrossTable(sms_test_pred, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))

#   Cell Contents
#   |-------------------------|
#   |                       N |
#   |           N / Row Total |
#   |           N / Col Total |
#   |-------------------------|
#   
#   
#   Total Observations in Table:  1390 
# 
#   | actual 
#      predicted |       ham |      spam | Row Total | 
#   -------------|-----------|-----------|-----------|
#            ham |      1203 |        32 |      1235 | 
#   |      0.974 |     0.026 |     0.888 | 
#   |      0.997 |     0.175 |           | 
#   -------------|-----------|-----------|-----------|
#           spam |         4 |       151 |       155 | 
#   |      0.026 |     0.974 |     0.112 | 
#   |      0.003 |     0.825 |           | 
#   -------------|-----------|-----------|-----------|
#   Column Total |      1207 |       183 |      1390 | 
#   |      0.868 |     0.132 |           | 
#   -------------|-----------|-----------|-----------|
# 

Conclusion

Well, in this point I think that the first question to be answered is: Is there some room for improvements? And in this case the answer is yes. Let’s analysis the result of the model.

Looking at the cross table, we can see that a total of only 4 + 32 = 36 of the 1,390 SMS messages were incorrectly classified (2.6 percent). Among the errors were 4 out of 1,207 ham messages that were misidentified as spam, and 30 of the 183 spam messages were incorrectly labeled as valid messages.

Even when the quantity of the valid messages misidentified as spam is low, after all 4 of 1,390 is just a 0,28% of error, we can try to improve the model in order to achieve better performance.

III. Improving the model

## Step 5: Improving the model
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels,
                              laplace = 1)

sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))
#   Cell Contents
#   |-------------------------|
#   |                       N |
#   |           N / Col Total |
#   |-------------------------|
#   
#   
#   Total Observations in Table:  1390 
# 
# 
#   | actual 
#      predicted |       ham |      spam | Row Total | 
#   -------------|-----------|-----------|-----------|
#            ham |      1204 |        31 |      1235 | 
#   |      0.998 |     0.169 |           | 
#   -------------|-----------|-----------|-----------|
#           spam |         3 |       152 |       155 | 
#   |      0.002 |     0.831 |           | 
#   -------------|-----------|-----------|-----------|
#   Column Total |      1207 |       183 |      1390 | 
#   |      0.868 |     0.132 |           | 
#   -------------|-----------|-----------|-----------|
#

Conclusions:

Adding the Laplace estimator reduce our false positive (ham messages erroneously classified as spam) from 4 to 3, and reduce our false negative from 32 to 31. Even if it looks like a very small improvement, we are increase in 25% our ability to reduce mistakes.

Sources:

  • Naive Bayes from R Bloggers
  • Ejemplo sencillo usando otro dataset para Naive Bayes:
    • https://rpubs.com/riazakhan94/naive_bayes_classifier_e1071
  • https://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
  • Complejo: https://eight2late.wordpress.com/2015/11/06/a-gentle-introduction-to-naive-bayes-classification-using-r/

Bibliografia