Exploring the replies from a question in Twitter using Python and R

One the most social phenomena in Twitter, are the open questions that anonymous users made to others.

It is pretty common to see how simple questions like: What do you prefer: white wine or red white? What do you choose for your holiday: sea or mountain? passed by our timeline almost everyday, and in the majority of the cases they only carry few answers perfectly countable at a glance.

However, some of them, a very few actually, of those questions become viral, and due to the volume of the answer they need an extra effort to determine the winner’s answer. In this article I will examine how to determine the prefered choice based on the replies from other users.

.

Our witness case

A few days ago, @sisifrabeta, one of the countless users from Twitter made a question about what house was preferred by their followers. 

.

General considerations 

One of the highest strengths that a network as Twitter has is the possibility to throw out a question to the space and countless strangers answered it. There are two ways to answer an specific question: using a poll which leaves the responsibilities of counting the votes to Twitter, or leaving the responsibility to the users. Of course, we are analyzing the second case, or this post would have no reason to be 🙂 

While the majority of the comments answered the question, the election of the words could be considered broad and varied due to the natural difference that any language presented. For instance, someone could write “uno”, others could choose “primero”; therefore the necessity to clean the answers towards an unique direction is mandatory to effectively answer the question, in this case: which is the prefered house by the followers? 

One important caveat, although tweets are forever in the account of those who wrote them, the time to recover them using Python or R is limited to a few days. For that reason, in order to facilitate the comprehension of the subsequent steps, I am leaving the database of the tweets analyzed in this article. 

.

.

Answer

The house that prefer the most is the number 1. Indeed, the author of this article as well as the majority prefer this house over the rest.

.

How to solve this riddle?

Just four steps to shed light on our interrogant.

  1. Capture all responses to the original tweet using Python.
  2. Create a dataframe using Panda
  3. Cleaning and merging for counting the votes towards every option using R.
  4. A plain graph showing the winner using R. 

.

The code

.

1. Capture all responses to the original tweet using Python.

import csv
import tweepy
import pandas as pd
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
 
# update these for whatever tweet you want to process replies to
name = 'sisifrabeta'
target_tweet_id = "1254778611971747843"
 
replies=[]
for tweet in tweepy.Cursor(api.search,q='to:'+name, result_type='recent', timeout=999999).items(1000):
   if hasattr(tweet, 'in_reply_to_status_id_str'):
       if (tweet.in_reply_to_status_id_str==target_tweet_id):
           replies.append(tweet)
 
# update these for whatever tweet you want to process replies to
tweet_pd = pd.DataFrame(replies

.

2. Creating a dataframe using Panda

def tweetstoDataFrame(tweets):
 
   DataSet = pd.DataFrame()
 
   DataSet['tweetID'] = [tweet.id for tweet in tweets]
   DataSet['tweetText'] = [tweet.text for tweet in tweets]
   ...
   return DataSet
 
#Pass the tweets list to the above function to create a DataFrame
DataSet = tweetstoDataFrame(replies)

#Export the dataset to csv
DataSet.to_csv('tweets.csv')

.

Our dataframe

.

3. Cleaning and merging for counting the votes towards every option using R

dat_tweets = read.csv("tweets.csv", header = TRUE)

#Filtering for location
dat_tweets_loc <- dat_tweets %>% 
  select(userLocation) %>% 
  group_by(userLocation) %>%
  summarize(count=n())

#Cleaning and unify all the tweets
a <- c('1', '2', '3', '4','')
names(a) <- c('uno|primera', 'dos|segunda', 'tres|tercera', 'cuatro|cuarta','@sisifrabeta')
dat_tweets$tweetTexttidy <- stringr::str_replace_all(dat_tweets$tweetText, a)

#adding stop words in Spanish
custom_stop_words <- bind_rows(stop_words,
                               data_frame(word = tm::stopwords("spanish"),
                                          lexicon = "custom"))


tweets <- dat_tweets %>%
  select(tweetTexttidy) %>%
  unnest_tokens(word, tweetTexttidy)
tweets <- tweets %>%
  anti_join(custom_stop_words)

.

4. Final outcome: a plain graph showing the winner using R. 

tweets %>% # gives you a bar chart of the most frequent words found in the tweets
  count(word, sort = TRUE) %>%
  top_n(4) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n, fill=as.factor(n)))