Tuesday, November 8, 2011

Digging deeper with the Twitter API: iPhone 4S vs. Galaxy Nexus

We found some interesting data when we dug below the surface of the iPhone 4S vs. Galaxy Nexus debate using the Twitter Search API.

In today’s world there is a vast quantity of data available online that can be used for research, market analysis, and competitive intelligence. While “Big Data” can be a problem for those who produce it, store it, and compile it, it is highly beneficial for those of us who are looking for answers.

Some of that data is fortunately available to be queried online, and, in particular, there is a vast quantity of data on social media interactions out there.

TweetsQueryingSearchAPI

In this article we will explore how to use the Twitter Search API from MapForce, Altova’s data mapping/conversion/integration tool, to aggregate data on recent user submissions (“tweets”) on two highly popular topics – the Apple “iPhone 4S” vs. the “Galaxy Nexus” as the latest hot Android phone – and extract some statistical data about the users engaged in those discussions.

One of the benefits of this abundance of data available to us today is that we can query it in interesting ways and extract new meaning from it. While there are undoubtedly many existing services that already provide trends over Twitter topics (e.g., Trendistic), those services only offer very simple trends and do not allow us to query any deeper.

But all of the underlying data is available for grabs if you are just willing to learn a tiny bit about web service APIs and how to use them to extract XML data for further processing. As a starting point, let's use the Twitter Search API to query the stream of recent tweets for the last 100 postings that are about the “Galaxy Nexus”. The Usage Guidelines for Twitter Search tell us that using both words in a query will result in the use of the default operator, which is AND, so we are going to search for posts that contain “Galaxy AND Nexus”. So let’s try that and request the most recent 100 items:

http://search.twitter.com/search.atom?q=galaxy+nexus&rpp=100

If you follow this link, you will get a second window with a lot of raw XML data that is formatted according to the Atom Syndication Format specifications. Alternatively, you could request the data in JSON format, if you wanted to directly process it via JavaScript code by hand, but we will use the XML-based Atom format so that we can easily analyze the data and extract the information we want.

Viewing the above search result in a browser is not very user-friendly, so we can take a quick peek at the XML data in our favorite XML Editor using the Open from URL function:

TweetsAtomGrid

As you can see, the data for each entry includes a language code, so for this example we will extract data from this Twitter feed as well as from a second search result on the “iPhone 4S” and combine them into one intermediate XML file for further analysis.

Extracting XML data is really easy in MapForce: using the “Insert XML File” option to drop in an XML source, we can again specify the same URL as before. If needed, MapForce will automatically create an XML Schema for the supplied data so we can visualize it and extract information from it:

TweetAtomMapping

In our mapping we have dropped in two sources on the left side – one using a query string to search for “Galaxy Nexus” and the other to search for “iPhone 4S” – and on the right side we have dropped in a simple XML Schema that will allow us to aggregate our data and analyze it more conveniently going forward. In this case the mapping between the two sides is straight-forward as we are only extracting basic information about the user, the date, and the language of the tweet, but in other applications the mapping could be more complicated and include functions as well as queries to other data sources, databases, or web services…

Previewing the resulting XML data can be done directly inside MapForce using the output tab, and this is what we see as a result of our data transformation:

TweetsRawData

Now we can easily use the reporting capabilities of StyleVision to group this data by language within each topic and count the number of posts in each language. We can then report this data in the form of pie charts, which produces the following interesting results:

TweetsByLanguage

Obviously, this data is highly dependent on the date of execution and time of day, as well as the particular announcements happening about these products, so the numbers will fluctuate quite a bit, but it can be used as a nice monitoring for seeing different language-specific trends. And once this has been set up, the report can be refreshed easily with the click of a button to get a snapshot at that point in time. For more long-term analysis it would of course be necessary to modify the mapping a bit to query more than 100 recent tweets.

In this article we have used Twitter’s Search API as one example data source and only looked at language as one unique data point, but there are many more interesting sources of data available online today, and this approach can be used on all of them in a similar fashion.

If you want to experiment with other data sources and other kinds of information that you want to extract, we invite you to try for yourself. A free 30-day evaluation version of MapForce is available, and there are no limits on how you can use the other features of Altova’s data mapping and conversion tool for data processing tasks that go beyond analyzing social media trends…

No comments: