Archive for the ‘Scraping Twitter’ Category
We work with several publishing clients that manage a multitude of Twitter accounts. We frequently run across the problem of spending too much time compiling a large cross section of key Twitter stats for multiple accounts. For example, we have a client that has over 20 Twitter accounts for unique brands unique brands. Trying to collect and analyze follower and status update counts is a tedious manual process, and not recommended.
There are two methods to accomplish data scraping from Twitter:
1) Develop a web application using the Twitter API
2) Use a mashup of tools mixed with Excel or Google Docs.
Building a web application to perform this scraping function has obvious financial and time barriers, but may be worth the effort or licensing fee in the long run.
Using Google Docs and ImportXML can handle this function nicely and has been covered thoroughly (I mean thoroughly) by Distilled in their ImportXML guide . Also, John Doherty makes good use of the API and Google Docs and has created a link prospecting tool. What I want to cover in this blog is how to collect basic Twitter following/follower stats across several accounts using Excel and Niels Bosma’s badass plugin: SEO Tools for Excel. Since the SEO Tools plugin is doing all the heavy lifting in this example, Niels should really get all the credit.
Step 1: Install SEO Tools for Excel.
There are a lot of other sweet features, but for this example we are just going to use the scraping function:
Step 2: Build a list of Twitter handles that you would like to collect via the Twitter API
First you need to figure out who you are interested in grabbing the data for. This could be your competition, your employees, different products within your company, punk bands, or whatever group of Tweeters you want to analyze. You can use services like WeFollow and Listorious to generate ideas. Compile the list and put the handles in the first column of your spreadsheet (don’t include a preceding @). With this particular XML feed here are some of the relevant things you can pull:
- ID: the numerical unique identifier for the account. This is a handy key to have because it will never change while the screen name can be changed by the owner of the account
- Name: the name of the person who registered the account
- Location: the geo-location of where the account was created
- Description: the user-created description for the account that shows up right under the screen name on Twitter.
- Profile img URL: the location of the profile image for that account
- Followers count: number of followers that this account has
- Friends count: number of accounts that this account is following
- Created at: the date and time that this account was created
- Status count: number of tweets since the account was created
- Listed count: the number of times that this account has been included in another accounts lists
Here is what my column headers look like:
You can download the example XLS file here. The formulas are all pre-populated so all you need to do is enter the Twitter names (30 max due to API rate limiting) that you want the information for in the “Twitter Handle” column. If you want to build it out yourself, below are the details.
Step 3: Construct the API Request URL (sheet 2)
We must create the proper URL to be able to pull this information from Twitter. In this case we are using the Twitter REST API resource: GET users/show. This is one of a handful of API calls that does not require user authentication.
There are two required components and one optional to the URL.
- Request URL: https://api.twitter.com/1/users/show.xml?screen_name=
- Screen name: the Twitter handle you would like information for
So, it will end up looking like this:
Optional: include entities. This API requests returns the latest tweet for the user info requested. If you would like to include information about the tweet such as user mentions, hashtags or associated URLs, then also include the following to the end of the URL: &include_entities=true. The final URL would look like this:
We now have to construct the URL. In sheet 2 of the spreadsheet you can see that I am simply concatenating two fields: the Twitter handle from sheet 1 and the URL from the current sheet.
Step 4: Write Your Formulas and Populate the Array
Nothing too fancy here thanks to SEO Tools for Excel. There is a pre-built function that does all the heavy lifting. You can select the XPath (reads XML) or JSON option with the Twitter API. In this example we will use XML.
The XPath function has two inputs: the URL to call the XML file, which we constructed above and is in sheet 2, and the instructions for selecting the right information in the file (the proper node). Here is a very basic write up of how to use XPath to select nodes within an XML file.
Once you have your formula written for each column, fill in the rest of your array and wait a bit as it fills in the data. The Twitter REST API is limited to 150 requests per hour and each cell in the array you have just created is an API call. In this example we have 5 columns and 25 Twitter handles so filling out this array once will be just 25 short of the hourly cap.
Step 5: Some Basic Analysis
Now that we have an array of data for a group of Twitter users we can do some very basic analysis. In follow up blogs, I will collect the data over time and do a little deeper look at trending analysis.
In sheet 3 you will find a few graphs based on this data. It is important to note that the data in sheet 1 is formatted as text because Excel is keying off of the formula. This makes analysis difficult on that tab itself. I copied and pasted the values to sheet 3(Copy>Paste Special>Values) and then converted the text to numbers.
I wanted to compare the number of followers juxtaposed with the average number of tweets per day. Twitter returns the date as (Day of week, Month, Day, Time Stamp and Year). In order to format the date so Excel can understand it you must pull the right information out of the ‘Date Created’ :
Excel still can’t understand it because it is reading the formula instead of what is being visually shown in the cell. In order to get it to work, use Copy>Paste Special>Values to overwrite the formula, go to Format Cells>Date> format as 04/23/2012.Create a new column called Avg. Tweets per day. You now need to figure out how many days have passed from the time the account was created and divide into that the total number of tweets.
Next, you can select the data in the Followers and Avg. Tweets per Day columns to generate your graph. I won’t go over the details on chart formatting here, but here is the end result: