Hey Google where are my real time protocols?
This video published from Google webmaster help on YouTube raises an interesting question. How does Google know who owns content, and what is the canonical source of it. It’s a problem that many content publishers suffer from. You publish a new, unique piece of content, and the people who syndicate you end up getting a lump of the traffic, and revenue that you’ve worked so hard to generate.
Google are getting better at working this out, and Matt Cutts speaks of how you can get Googlebot to index the content first, using strategies such as delaying your feed and the use of the canonical tag can all help them rank the owner of the content first. In the video, he also mentions the use of ”pings” to alert Googlebot to come and get it.
Which got me thinking. Why not just let us provide it direct to them?
Currently, Googlebot has to go out, crawl the web and come back. A slow, and unwieldy process that has seen them suffer at the hands of Twitter and Facebook who are getting the information fed to them daily. Pulling the information, rather than asking webmasters to push it to Google. This concept lies at the heart of real-time, and is the driving force behind the social efforts of Google+ – to build a network that as a byproduct of activity, provides real-time insight into what is happening on the web.
With the development of PubSubHub some time ago, that somewhere along the way, I thought we were going to get our wishes granted. Feed my site information in real time to Google? Yeah, I’d bite at that. A real-time sitemap protocol would solve so many issues for both parties. Firstly, you could get the information to Google so there would be no confusion over who owns it, and secondly it would appear in the index must faster, and help to power real-time search. Who wouldn’t want elements of their site that change rapidly to be a part of that service? Even with the adoption of RSS, there doesn’t seem to be a concentrated effort on webmasters who publish throughout the day, particularly if you fall out of the “news” or “blog” remit.
Having dropped Twitter from real-time search, Google are betting that with increased usage of Google+, and the roll out of social features across multiple services that at some stage, they will soon have enough data to power real-time once again – without paying the hefty licensing fee to Twitter. Will brands and webmasters will be expected to implement Google+ pages to play ball in the real-time world?
Maybe. Maybe not.
My guess, is that the forthcoming Google+ API is going to provide some very interesting technical concepts to facilitate real time search for the web, without sole reliance on the manual use of Google+. A way for publishers to get information even quicker (via some sort of conceptual real time sitemap / private RSS feed) into Google’s real time index might well be one of them.
There are a ton of verticals that Google have historically used user provided feeds to power. Google News, Google Base, Google code search and more. Taking that concept and applying it to real time search as well could provide interesting results, and give publishers additional control over the information that appears in the real time index when it gets up and running again.
