Posted in: Archive
Having taken a peek at Google’s robots.txt file, I uncovered a pretty interesting URL. (It’s a biggish download if you are wondering why your browser is struggling).
Turns out the entire list of Google profiles that Google has on it’s books (which are forming the basis of Google +1) are all publicly available.
Just out of curiosity I thought I’d do some basic analysis of just how many profiles they are exposing back to Googlebot in this way – if for no other reason than to see how far they were off in terms of numbers with Facebook. There’s no guarantee that this represents the entire dataset of Google profiles, but with Google generally choosing to do this sort of thing by the book, and indeed encouraging webmasters to utilise the Sitemap format to make their own lives easier, my guess is that it may well be the full selection of URLs.
I fired up MySQL, created a couple of tables to hold the data, and managed to find a great little Java library that parses sitemaps. I wrote a very quick and dirty program to grab all of the URLs and add them to the database. At time of writing it looks like Google have 35 million (35, 437, 869 to be exact) profiles in their database. A tiny slice of Facebook’s reported 500 million profiles.
I may at some stage grab some of the data off these (name, geo-co-ordinates) to create a basic location based people search engine for Google profiles, something Google clearly have the capability of doing themselves, but haven’t managed to implement properly or release as yet. You can see how useless their current profile search results are from the link; I couldn’t even find myself, let alone others in my social circle. Let me know if you’d find something like this useful, or if indeed you’d be interested in a dump of the SQL file.