Archive for the ‘Scraping Twitter’ Category

Twitter Data Scraping and Insights Using Excel

June 7th, 2012
by Tim Resnik

This is a follow up to a post I did a few weeks ago about using Excel to scrape key user data from Twitter. In that post I set up a spreadsheet that was tracking 20 of the most followed people in the SEO industry. After collecting that data on a daily basis over the last few weeks, I am going to use this post to demonstrate some basic analysis that can be done. It is not meant to be an Excel tutorial. For one of those, I recommend you check out Distilled’s guide on Excel for SEO, and of course Richard Baxter has a wealth of Excel tips and tutorials on his blog.

Step 1: Format Your Data

Here’s the spreadsheet that goes along with this post - Twitter Scrape Analysis – Excel

I’ve been collecting Twitter data on these 20 SEOs for about 30 days. That’s about 600 lines of data in my Excel spreadsheet. Pretty modest from a data analysis perspective, but we still need to make sure that things are formatted kindly. There are really only two things we’ll be doing with the data 1) adding a few columns for daily trending, i.e. number of followers gained or lost per day, and 2) creating a pivot table. In order to add the trending we are going to some data sorting, so I suggest that you put your data into a table. (Select all the data and CRTL + L).

Step 2: Basic Analysis Inline with your Data

The spreadsheet of our data is attached, but for quick review, our columns look like this:

data shot of twitter handles


There are two columns I added next for trending: one for calculating the daily followers gained or lost and a column to count the number of Tweets broadcast. You could do a similar column for following and listed. Next, I perform a quick and dirty formula to calculate the daily trend. There is probably a more elegant solution, but I found this to work, so I went with it:

  • Sort, by handle and then by date (oldest to newest)
Excel Sort
  • Formula: IF(Table2[[#This Row],[handle]]=A1,Table2[[#This Row],[followers]]-C1,”start”). This function checks to see if the two numbers you are calculating belong to the same handle. Then does basic subtraction. This will give you the difference between the row you are looking at and the day before.
  • Clean-up: this is an important step. Copy the entire table and paste values in order to overwrite the formulas with the values. If you don’t, the numbers will change when sorting. Next, sort the follower gain/loss column and delete all the fields that say “start”. We want to make sure we only have number in this column so the pivot tables and graphs translate properly when we are building our visualizations.

Step 3: Visualize

Let’s make the data more meaningful and present that data in a simple way that could bubble up some insights. There are a hundred different tools and visualization methods that you could choose, but I am going to keep it very simple and use a pivot table for this example. The main benefit of a pivot table is to summarize like data points, so in this case we want to see basic trending information by each twitter handle that we are tracking.

In the data tab select the whole table, and under the “insert” ribbon tab click on ‘Pivot’ table. A new Pivot table tab will be generated with the field selector open. In the field selector drag handles into the row labels field and followers. Then, add follower gain/loss and daily tweets to the values field. Make sure to select “value field setting” and then select “average” for each. Excel defaults to SUM, which would provide the sum of all the entries in the data table for the handle.

Excel Pivot Table Field Settings

To make it’s more readable sort by followers by clicking under more sort options next to “Row Labels”. Then select “Descending (Z to A) by:” “Average of Followers”:

Excel Pivot Sort


Next, add some conditional formatting to each row so some of the outliers pop out. You have to format each row separately or it will use the largest number, in this case dannysullivan’s followers as the baseline for all fields.

Excel Conditional Formatting

The keen observer has probably noticed that using followers gain/loss as a measuring stick to judge one Twitter account from another is not an apples to apples comparison. Followers beget more followers and Danny Sullivan and Matt Cutts have the advantage in this group. What is more telling is to normalize the data and compare the ratio of average follower gained to the sum of total followers. This can be done by selecting a single field of the pivot table under ‘Average of follower gain/loss’ column and then selecting options (in the ribbon)>formulas>calculated fields. Looks something like the below. Simply double-click in ‘follower gain/loss’, divide that by ‘followers’ and hit OK.

Excel Calculated Field

Make sure to format the new column into percentage, or it will just round off to zero, and that’s not very useful insight. Your end product should look something like the below and will provide you some insights into which handles are the outliers of your group.

twitter trending visualization

In this case there are a few things that stand out, including the fact that Aaron Wall has been losing followers at the fastest rate. He also has not Tweeted in nearly a month. Is his lack of engagement with his followers causing unfollows? It could be, but correlation is not causation as our friends at SEOmoz remind us. A deeper dive into the data may help us reveal something… Ah-ha! In looking at the raw data we see that aaronwall got dropped by 284 people in an 8 day period  (note: he no longer uses this account to Tweet, rather SEObook, but the decrease in followers was dramatically higher over this 8 day period than compared to the other days in this data set). In doing a little digging, a blog posted by AaronWall on SEObook entitled “Educating the Market: Is outing & writing polarizing drivel hate baiting or a service to the community” caused a bit of an uproar. Could the post on SEObook have had a negative impact on the @AaronWall Twitter profile? A similar analysis could be done by a publisher on a journalist who wrote a controversial piece of content. However, to really understand what’s going on we need to explore tone and sentiment. I’ll be mashing this data with open source sentiment analysis tools in future blogs. Stay tuned…

Enhanced by Zemanta

Mining Twitter Data with Excel and other tips for Twitter Analytics

April 23rd, 2012
by Tim Resnik

We work with several publishing clients that manage a multitude of Twitter accounts. We frequently run across the problem of spending too much time compiling a large cross section of key Twitter stats for multiple accounts. For example, we have a client that has over 20 Twitter accounts for unique brands unique brands. Trying to collect and analyze follower and status update counts is a tedious manual process, and not recommended.

There are two methods to accomplish data scraping from Twitter:

1)      Develop a web application using the Twitter API

2)      Use a mashup of tools mixed with Excel or Google Docs.

Building a web application to perform this scraping function has obvious financial and time barriers, but may be worth the effort or licensing fee in the long run.

Using Google Docs and ImportXML can handle this function nicely and has been covered thoroughly (I mean thoroughly) by Distilled in their ImportXML guide . Also, John Doherty makes good use of the API and Google Docs and has created a link prospecting tool.  What I want to cover in this blog is how to collect basic Twitter following/follower stats across several accounts using Excel and Niels Bosma’s badass plugin: SEO Tools for Excel. Since the SEO Tools plugin is doing all the heavy lifting in this example, Niels should really get all the credit.

Step 1: Install SEO Tools for Excel

There are a lot of other sweet features, but for this example we are just going to use the scraping function:

 SEO Tools for Excel

Step 2: Build a list of Twitter handles that you would like to collect via the Twitter API

First you need to figure out who you are interested in grabbing the data for. This could be your competition, your employees, different products within your company, punk bands, or whatever group of Tweeters you want to analyze. You can use services like  WeFollow and Listorious to generate ideas. Compile the list and put the handles in the first column of your spreadsheet (don’t include a preceding @). With this particular XML feed here are some of the relevant things you can pull:

  • ID: the numerical unique identifier for the account. This is a handy key to have because it will never change while the screen name can be changed by the owner of the account
  • Name: the name of the person who registered the account
  • Location: the geo-location of where the account was created
  • Description: the user-created description for the account that shows up right under the screen name on Twitter.
  • Profile img URL: the location of the profile image for that account
  • Followers count: number of followers that this account has
  • Friends count: number of accounts that this account is following
  • Created at: the date and time that this account was created
  • Status count: number of tweets since the account was created
  • Listed count: the number of times that this account has been included in another accounts lists
Or, checkout all the gory details

Here is what my column headers look like:

 Excel Column Headers

You can download the example XLS file here. The formulas are all pre-populated so all you need to do is enter the Twitter names (30 max due to API rate limiting) that you want the information for in the “Twitter Handle” column. If you want to build it out yourself, below are the details.

Step 3: Construct the API Request URL (sheet 2)

We must create the proper URL to be able to pull this information from Twitter. In this case we are using the Twitter REST API resource: GET users/show. This is one of a handful of API calls that does not require user authentication.

There are two required components and one optional to the URL.


  • Request URL:


  • Screen name: the Twitter handle you would like information for

So, it will end up looking like this:

Optional: include entities. This API requests returns the latest tweet for the user info requested. If you would like to include information about the tweet such as user mentions, hashtags or associated URLs, then also include the following to the end of the URL: &include_entities=true. The final URL would look like this:

We now have to construct the URL. In sheet 2 of the spreadsheet you can see that I am simply concatenating two fields: the Twitter handle from sheet 1 and the URL from the current sheet.

Step 4: Write Your Formulas and Populate the Array

Nothing too fancy here thanks to SEO Tools for Excel. There is a pre-built function that does all the heavy lifting. You can select the XPath (reads XML) or JSON option with the Twitter API. In this example we will use XML.

The XPath function has two inputs: the URL to call the XML file, which we constructed above and is in sheet 2, and the instructions for selecting the right information in the file (the proper node). Here is a very basic write up of how to use XPath to select nodes within an XML file.


Once you have your formula written for each column, fill in the rest of your array and wait a bit as it fills in the data. The Twitter REST API is limited to 150 requests per hour and each cell in the array you have just created is an API call. In this example we have 5 columns and 25 Twitter handles so filling out this array once will be just 25 short of the hourly cap.

Step 5: Some Basic Analysis

Now that we have an array of data for a group of Twitter users we can do some very basic analysis. In follow up blogs, I will collect the data over time and do a little deeper look at trending analysis.

In sheet 3 you will find a few graphs based on this data. It is important to note that the data in sheet 1 is formatted as text because Excel is keying off of the formula. This makes analysis difficult on that tab itself. I copied and pasted the values to sheet 3(Copy>Paste Special>Values) and then converted the text to numbers.

I wanted to compare the number of followers juxtaposed with the average number of tweets per day. Twitter returns the date as (Day of week, Month, Day, Time Stamp and Year). In order to format the date so Excel can understand it you must pull the right information out of the ‘Date Created’ :

Format Dates in Excel

Excel still can’t understand it because it is reading the formula instead of what is being visually shown in the cell. In order to get it to work, use Copy>Paste Special>Values to overwrite the formula, go to Format Cells>Date> format as 04/23/2012.Create a new column called Avg. Tweets per day. You now need to figure out how many days have passed from the time the account was created and divide into that the total number of tweets.

Average Tweets per Day

Next, you can select the data in the Followers and Avg. Tweets per Day columns to generate your graph. I won’t go over the details on chart formatting here, but here is the end result:

followers to tweets graph


Follow up post: Twitter Data Scraping and Insights Using Excel