I managed to get my hands on a alpha preview of Datasift – the tool that I’ve been raving on about since I first saw a preview of the service. Essentially, its like a Yahoo Pipes for social media, allowing you to see a plethora of data that is happening in real-time on the web, and more easily curate, organise and consume that content. From my initial play around with the service I’ve been gradually getting to grips with some of the things that it can do, and I am really impressed with the Twitter feature set alone. The power of this web application has warranted a full review – these guys are going to be a big deal for programmers, as the layer that they’ve provided takes information extraction from the real time web to a whole different level.
Who’s it for
Anyone who has an interest in mining the firehose of information that is happening out there on the web in real time. Twitter Search doesn’t cut it. Facebook search doesn’t cut it. Google -simply doesn’t drill down far enough, or provide enough filtering of the information to make it really useful. That’s where Datasift comes in. Essentially its an API of access to all that’s going on right now on the web. Say hello to a whole new social API.
Streaming my life away
Datasift works around the concept of streams, flows of data which can be consumed via their API. Two ways are provided to access the information which you’ve have extracted – standard HTTP streaming, and Websocket streaming. Great to see them implementing Websockets – a technology which without a doubt will become more important.
Authorisation for the service follows Twitter’s original programming model. No authentication needed, but rate limits for consumption are opened up when you do provide relevant authentication. I imagine this is probably to facilitate more rapid growth of the service, and probably somewhere down the line will be discontinued.
Datasift have created their own programming language (Filtered Stream Definition Language – or FSDL for short) to extract data from their database, which is easy for anyone with basic logic to use. Logical AND’s NOT’s and OR’s and even regular expressions can help you combine rules to bring back the information which makes up your ‘stream’. The basic building blocks of a query are provided to you, in much the same way as Yahoo Pipes works. The following is a quick overview of the way information is organised behind the scenes.
Author: A datasift user, along with associated properties of that user.
Interaction: Data values common to all of the below sources.
Twitter: Mines the data stream from within Twitter. A number of data properties exist (see above image for a quick overview)
Link: Performs actions on links from within all the sources
Sixapart: Mines the data stream from within SixApart (LiveJournal entries and blogging information)
Feed: Allows you to access feed information which already exists within Datasift, and use it in a new query.
Buzz: Provides information from within Google Buzz.
MySpace: Mines the data stream from within MySpace.
WordPress: Mines the data stream from within the WordPress network of blogs
Digg: Mines the data stream from within the Digg network.
Saliance: This is where it gets interesting. Datasift have provided a sentiment analysis layer ontop of their search API. This means you can determine whether a particular piece of information is negative or positive, making it easier to determine if the information extracted is potentially needing actioned. If you were for example using the service as a part of your reputation management solution, your processes could be much more automated.
TweetMeme: As TweetMeme is one of the services which Datasift originally created, it stands to reason that it made its way into the data services of the product. As you would expect queries can be performed on the number of tweets which a particular URL has been received, along with a snippet of data from that URL. This source will undoubtedly make things even easier when trying to mine data for popular pages.
Peerindex / Klout / Infochimps: Peerindex, Klout and Infochimps all provide data on a particular users influence score within social networks. This allows you to return data filtered by the social influence of users, again resulting in popular material.
First Look. The interface
The interface within Datasift is easy to find your way around, and I was able to get up and running with a new stream quickly. A number of examples are available to get you on your way, and with more and more Alpha testers joining everyday, it is a trivial process to see the work that others have done, and branch streams to your own needs. I created a simple stream in about 15 minutes, after looking around the documentation for a bit. Each one of these streams can then be consumed however you wish via their API. The following screens show just what is going on at each stage of the process. You can see the editor for FSDL in screenshot four.
One small thing to note, is that once you’ve created a particular definition for the query you’d like to run, it takes a while for Datasift to actually build it. Obviously a process is running in the background to more easily collated the data you have requested, and return it to you in due course. That doesn’t however make the service any less useful, or interesting. If anything, it forces you to make sure your query and request are logically correct before you instantiate the build process.
What can you build?
I can see this being used to create bespoke social search engines which fire queries into Datasift, and return a wealth of data back. In much the same way as Google Co-op works for ordinary queries, Datasift is likely to spawn interesting social engines, with its underlying engine being used for the datasource. A PHP client library has already been provided in GitHub, for those of you who want to get their hands dirty with the API, and get building. For those who don’t, you can still manage to see what’s going on inside their existing interface.
To date, a number of streams have been created that you can inherit from. The following are a collection of the streams being created at time of writing:
Job Search – Search for resumes
Premier League Scores – search for social mentions of football teams
Bargains, Deals and special offers – searches for special offers across the social web.
Some of my own thoughts on what is possible include the following:
GeoInfluencers – Finding influencers in a particular geo region on Twitter. This would be particularly useful to work out who to follow in your region.
Future predictors – Positive or Negative saliance around a topical topic – e.g. Xfactor or an Election candidate
Popular URLS with Negative Sentiment – Recently hot on Digg, or Tweetmeme URL’s which have negative sentiment.
Who’s talking? – A collection of users in a particular area, tweeting a particular topic or URL.
This is just scratching the surface of the possibilities. With full regular expression support, and the already substantial list of data sources, I’ve no doubt we’ll see some creativity with the service. All in all, its a great web based software product that will undoubtedly grow, and become a useful tool to have in a developers kitbag – particularly if he or she is working with the social web, or performing any kind of data mining at present. Well done guys. I for one am excited for the possibilities.