Sentiment Analysis Project in R

In the last post, we talked about getting the data from twitter, in this post we will talk about the sentiment analysis on text. either from tweets that we just downloaded in our data set or full text documents. if you are not familiar with R, check out this post here to begin

What is sentiment analysis?

So what is sentiment analysis, as the name implies Sentiment analysis is the computational task of automatically determining what feelings a writer is expressing in text. Sentiment is framed as a binary distinction (positive vs. negative), but it can also be a more fine-grained, like identifying the specific emotion an author is expressing (like fear, joy or anger).

Sentiment analysis is used for many applications, especially in business intelligence. Some examples of applications for sentiment analysis include:

  • Analyzing the social media discussion around a certain topic
  • Evaluating survey responses
  • Determining whether product reviews are positive or negative

Since it is an automatic process based on a logic it is not perfect, and as with any automatic analysis of language, you will have errors in your results. It also cannot tell you why a writer is feeling a certain way. However, it can be useful to quickly summarize some qualities of text, especially if you have so much text that a human reader cannot analyze all of it.


How does it work?

There are many ways to do sentiment analysis. Many approaches use the same general idea, however:

  1. Create or find a list of words associated with strongly positive or negative sentiment.
  2. Count the number of positive and negative words in the text.
  3. Analyze the mix of positive to negative words. Many positive words and few negative words indicates positive sentiment, while many negative words and few positive words indicates negative sentiment.

Such a list is called lexicon, While you can often use a lexicon that already exists, if your text is discussing a specific topic you may need to add to or modify it.

“Sick” is an example of a word that can have positive or negative sentiment depending on what it’s used to refer to. If you’re discussing a pet store that sells a lot of sick animals, the sentiment is probably negative. On the other hand, if you’re talking about a skateboarding instructor who taught you how to do a lot of sick flips, the sentiment is probably very positive.

Let’s begin

For this tutorial, we’re going to analyze how the sentiment of the State of the Union address, which is a speech given by the President of the United States to a joint session of congress every year. I’m interested in seeing how sentiment has changed over time, from 1989 to 2017, and whether different presidents tend to have more negative or more positive sentiment.

First, let’s load in the libraries we’ll use and our data.

get the dataset from here

# load in the libraries we'll need
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)

# get a list of the files in the input directory
files <- list.files("../input")
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘glue’

The following object is masked from ‘package:dplyr’:

    collapse

Let’s start with the first file. The first thing we need to do is tokenize it, or break it into individual words.

# stick together the path to the file & 1st file name
fileName <- glue("../input/", files[1], sep = "")
# get rid of any sneaky trailing spaces
fileName <- trimws(fileName)

# read in the new file
fileText <- glue(read_file(fileName))
# remove any dollar signs (they're special characters in R)
fileText <- gsub("\\$", "", fileText) 

# tokenize
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)

Now that we have a list of tokens, we need to compare them against a list of words with either positive or negative sentiment.

A list of words associated with a specific sentiment is usually called a “sentiment lexicon”.

Because we’re using the tidytext package, we actually already have some of these lists. I’m going to be using the “bing” list, which was developed by Bing Liu and co-authors.In [3]:

# get the sentiment from the first text: 
tokens %>%
  inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
  count(sentiment) %>% # count the # of positive & negative words
  spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
  mutate(sentiment = positive - negative) # # of positive words - # of negative owrds
Joining, by = "word"
negativepositivesentiment
117240123

So this text has 117 negative polarity words and 240 positive polarity words. This means that there are 123 more positive than negative words in this text.

Now that we know how to get the sentiment for a given text, let’s write a function to do this more quickly and easily and then apply that function to every text in our dataset.In [4]:

# write a function that takes the name of a file and returns the # of postive
# sentiment words, negative sentiment words, the difference & the normalized difference
GetSentiment <- function(file){
    # get the file
    fileName <- glue("../input/", file, sep = "")
    # get rid of any sneaky trailing spaces
    fileName <- trimws(fileName)

    # read in the new file
    fileText <- glue(read_file(fileName))
    # remove any dollar signs (they're special characters in R)
    fileText <- gsub("\\$", "", fileText) 

    # tokenize
    tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)

    # get the sentiment from the first text: 
    sentiment <- tokens %>%
      inner_join(get_sentiments("bing")) %>% # pull out only sentimen words
      count(sentiment) %>% # count the # of positive & negative words
      spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
      mutate(sentiment = positive - negative) %>% # # of positive words - # of negative owrds
      mutate(file = file) %>% # add the name of our file
      mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
      mutate(president = str_match(file, "(.*?)_")[2]) # add president

    # return our sentiment dataframe
    return(sentiment)
}

# test: should return
# negative	positive	sentiment	file	year	president
# 117	240	123	Bush_1989.txt	1989	Bush
GetSentiment(files[1])
Joining, by = "word"
negativepositivesentimentfileyearpresident
117240123Bush_1989.txt1989Bush

Now, let’s apply our function over every file in our dataset. We’ll also need to make sure we can tell the difference between the two presidents named “Bush”: Bush and Bush Sr.In [5]:

# file to put our output in
sentiments <- data_frame()

# get the sentiments for each file in our datset
for(i in files){
    sentiments <- rbind(sentiments, GetSentiment(i))
}

# disambiguate Bush Sr. and George W. Bush 
# correct president in applicable rows
bushSr <- sentiments %>% 
  filter(president == "Bush") %>% # get rows where the president is named "Bush"...
  filter(year < 2000) %>% # ...and the year is before 200
  mutate(president = "Bush Sr.") # and change "Bush" to "Bush Sr."

# remove incorrect rows
sentiments <- anti_join(sentiments, sentiments[sentiments$president == "Bush" & sentiments$year < 2000, ])

# add corrected rows to data_frame 
sentiments <- full_join(sentiments, bushSr)

# summerize the sentiment measures
summary(sentiments)
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = c("negative", "positive", "sentiment", "file", "year", "president")
Joining, by = c("negative", "positive", "sentiment", "file", "year", "president")
    negative        positive       sentiment         file          
 Min.   : 49.0   Min.   :169.0   Min.   : 43.0   Length:29         
 1st Qu.:122.0   1st Qu.:240.0   1st Qu.:102.0   Class :character  
 Median :144.0   Median :289.0   Median :140.0   Mode  :character  
 Mean   :143.5   Mean   :280.4   Mean   :136.9                     
 3rd Qu.:161.0   3rd Qu.:310.0   3rd Qu.:171.0                     
 Max.   :223.0   Max.   :438.0   Max.   :277.0                     
      year       president        
 Min.   :1989   Length:29         
 1st Qu.:1996   Class :character  
 Median :2003   Mode  :character  
 Mean   :2003                     
 3rd Qu.:2010                     
 Max.   :2017                     

It looks like every State of the Union address in this dataset has an overall positive sentiment (according to this measure). This isn’t very surprising: most text, especially formal text, tends to have a positive skew.

Let’s plot our sentiment analysis scores to see if we can notice any other patterns. Has sentiment changed over time? What about between presidents?In [6]:

# plot of sentiment over time & automatically choose a method to model the change
ggplot(sentiments, aes(x = as.numeric(year), y = sentiment)) + 
  geom_point(aes(color = president))+ # add points to our plot, color-coded by president
  geom_smooth(method = "auto") # pick a method & fit a model
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

While it looks like there haven’t been any strong trends over time, the line above suggests that presidents from the Democratic party (Clinton and Obama) have a slightly more positive sentiment than presidents from the Republican party (Bush Sr., Bush and Trump). Let’s look at individual presidents and see if that pattern holds:In [7]:

# plot of sentiment by president
ggplot(sentiments, aes(x = president, y = sentiment, color = president)) + 
  geom_boxplot() # draw a boxplot for each president

It looks like this is a pretty strong pattern. Let’s directly compare the two parties to see if there’s a reliable difference between them. We’ll need to manually label which presidents were Democratic and which were Republican and then test to see if there’s a difference in their sentiment scores.In [8]:

# is the difference between parties significant?
# get democratic presidents & add party affiliation
democrats <- sentiments %>%
    filter(president == c("Clinton","Obama")) %>%
    mutate(party = "D")

# get democratic presidents & party add affiliation
republicans <- sentiments %>%
    filter(president != "Clinton" & president != "Obama") %>%
    mutate(party = "R")

# join both
byParty <- full_join(democrats, republicans)

# the difference between the parties is significant
t.test(democrats$sentiment, republicans$sentiment)

# plot sentiment by party
ggplot(byParty, aes(x = party, y = sentiment, color = party)) + geom_boxplot() + geom_point()
Warning message in president == c("Clinton", "Obama"):
“longer object length is not a multiple of shorter object length”Joining, by = c("negative", "positive", "sentiment", "file", "year", "president", "party")
	Welch Two Sample t-test

data:  democrats$sentiment and republicans$sentiment
t = 2.2442, df = 16.606, p-value = 0.03876
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.256549 75.320374
sample estimates:
mean of x mean of y 
 149.2500  110.4615 

So it looks like there is a reliable difference in the sentiment of the State of the Union addresses given by Democratic and Republican presidents, at least from 1989 to 2017.

There a couple things to keep in mind with this analysis, though:

  • We didn’t correct for the length of the documents. It could be that the State of the Union addresses from Democratic presidents have more positive words becuase they are longer rather than becuase they are more positive.
  • We’re using a general-purpose list of words rather than one specifically designed for analyzing political language. Furthermore, we only used one sentiment analysis list.

With these two things in mind, I’ve written a couple of exercises for you. You can fork this notebook and continue working in the code chunks below.