Without dictionaries there is no sentiment analysis

To perform a sentiment analysis all that we need is a dictionary and a text. In plain words the idea is: pick up a word from the text, verify the inclusion into the dictionary, and after that, the dictionary shows if it is positive or negative word and how negative or positive it is through adding or subtracting points.

Based on that, what is the most important variable to perform a sentiment analysis? The dictionary off course! how complete it is, and the asigned values to each word: the negative value of the word suicide is not the same as that of the word hit.

Ideas to validate

Do dictionaries with a greater number of words have better performance? Are more accurate results obtained when the dictionary and the text to be analyzed are in the same language, or is it preferable to consider a larger dictionary and the translated text?

Analysis

1. Texts

Let’s analyze the book: The Quixote of La Mancha, which has 381,104 words and 52 chapters in the first part and 74 chapters in the second part. To make sentiment analysis as this posed present preferred format the text like this:

chapter_npartchapter_n_ochapter_titlechapter_text
111Capítulo primero. Que trata de la condición y ejercicio del famoso hidalgo
don Quijote de la Mancha
En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho
tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua,
rocín flaco y galgo corredor...
212Capítulo II. Que trata de la primera salida que de su tierra hizo el
ingenioso don Quijote
Hechas, pues, estas prevenciones, no quiso aguardar más tiempo a poner en
efeto su pensamiento, apretándole a ello la falta que él pensaba que hacía
en el mundo su tardanza....
313Capítulo III. Donde se cuenta la graciosa manera que tuvo don Quijote en
armarse caballero
Y así, fatigado deste pensamiento, abrevió su venteril y limitada cena; la
cual acabada, llamó al ventero, y, encerrándose con él en la caballeriza,
se hincó de rodillas ante él, diciéndole:

2. Dictionaries

Using the tidytext package we find the three general-purpose lexicons for English:

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

We also can we use one lexicons for Spanish:

A comparison between the four dictionaries shows us the number of words that compose them:

LexiconWords
AFINN2,477
Bing et al.6,800
NRC14,182
URL Lexicon1,347

3. Visualization

We will make the comparison of the performance of dictionaries throughout the work, dividing it into chapters.

3. Analysis of the result

The four different lexicons for calculating sentiment give results have fairly similar relative trajectories through the chapters.
In general, dips and peaks are very similar same places, but the absolute values are significantly different.

  • NRC, Bing et al. and AFINN have very similar relative trajectories.
  • Sentiment Lexicons in Spanish have some similar dips and peaks with AFINN lexicon.
  • It appears the NRC lexicon finds more positive sentiments than the AFINN lexicon.
  • Bing et al found more negative sentiments

4. Top words per feeling according to each dictionary

A second analysis that we can perform is, according to each dictionary, the ranking of negative and positive words:

5. Analysis of the use of different dictionaries

Comparing the top 10 words with positive and negative feelings we find:

  • There are some words common to every dictionary: good, gran, bueno, god, love..
  • translation errors: when working with a translated work there are words that are assigned a feeling that they do not have. For example, the NRC dictionary considers the word “Don” as an extraordinary gift or skill and hence gives it a positive value, but in Old Spanish and this book “Don Quijote de la Mancha” was written more than 500 years ago, the word “Don” is the title of courtesy for the gentlemen.

Conclusions

Based on the previous analysis, the dictionaries with the most words have better performance, obtaining more accurate results and a greater detection of feelings. Regardless of whether the work is in the original language, if the selected dictionary contains fewer terms, the result obtained is imprecise.