Like it or not, there’s no getting away from the fact that SEO is at least in part a technical pursuit. Regardless of how successful or popular your content, if you haven’t built on top of solid technical best practises, Googlebot will hiccup and cough its way around your site like an asthmatic chimney sweep. There are lots of things developers should know when building a successful site, right from having a firm understanding of how Google works out what a page is about, to more complicated aspects such as status codes, redirects and site structure. This is the absolute minimum that a developer should know and understand when building a site that will perform well in Google.
Rule 1 when it comes to web development – is understanding what type of web request to use and when. Think Vitamin have done a pretty good job of explaining the RFC specification, however, when it comes to web crawlers – it is important to note that typically, robots such as Googlebot won’t perform POST requests. Due to the fact that POST requests are unsafe, that makes sense. Many developers fail to understand that Google won’t randomly try stuff in your search to find the information in your database.
If you for example, run some kind of search on your site – or rely on your users posting a form request off to find information, then you need to make your results page accessible elsewhere on your site, either through a sitemap or by a good internal linking structure. More on sitemaps in the next section.
Content discovery, and getting your content into Google before others is an important part of search strategy. There are two advantages to this – firstly, the fastest result gets the traffic, particularly on breaking news – and secondly, you are more likely to rank higher if Google can determine you are the original author. One of the ways they determine this, is by looking at what time a particular piece of content hits the web. First published = typically the original source, and the URL which deserves the traffic. See this video from Google.
From a technical perspective there are a couple of things you can do to increase the likelihood of Googlebot finding your content first. RSS feeds are a real-time way for Google to grab fresh content, without the technical overhead of a full crawl, and they definitely use the below tag to help find the location of your feed if you have one. Even a partial feed can help.
<link rel="alternate" type="application/rss+xml" href="rss.php" />
Another way to get your content indexed faster in Google, in particular for blogs – is the pingback specification. Google blog search RPC server (http://blogsearch.google.com/ping/RPC2) can be pinged to tell them via an XML post when you’ve updated content. Social web publishing e.g. Tweets, and shares, and using third party websites to announce your content to the world, can all help speed up the discovery of your content. In some cases, utilising third party APIs and automating some of this work can be a clever move.
Sitemaps are available for a variety of content types. Google have publicly stated that you should use both HTML and XML site maps if you can, as they are good for users and search engines:
Links are a pretty essential part of how search engines determine ranking. To massively simplify, a large part of the Google algorithm ranks the pages which have the most relevant links from third party websites. If your chosen webpage is trying to rank for say ‘ amazing motorbikes ‘ – and you receive thousands of links from around the web from motorbike enthusiast websites, or better again large sites such as Harley Davidson or Yamaha with the link text of ‘amazing motorbikes’ chances are, you’ll feature high up (if not first) in the results for that term.
The important thing is – the same applies for internal links. i.e. your website navigation, and the text you use to link between pages are massively important.
Link Text and Structure
A common mistake I’ve seen developers make, is linking through to articles via an image rather than using link text, or alt text. Consider the following:
1) The wrong approach – no additional text relevant to the article passed ‘through’ the <a> tag
<div class="article"> <a href="article.htm"><img src="article-image.jpg" /></a> <span>Amazing Motorbike 1</span> </div>
2) The better approach – no link text however, the alt text on your image will pass some keyword relevance through to article.htm.
<div> <a href="article.htm"><img src="article-image.jpg" alt="Amazing Motorbike 1" /></a> <span>Amazing Motorbike 1</span> </div>
3) The best approach – using addition attributes in your HTML to provide value to users and search engines, including the span inside the link making it both clickable and passing exact context through to the linking page.
<div> <a href="article.htm" title="Article on amazing motorbikes"> <img src="article-image.jpg" alt="Description of the motorbike, red stripes with black spots etc." title="Amazing Motorbike Picture" /> <span>Amazing Motorbike 1</span> </a> </div>
Another important thing developers really need to know about links. If you have two links on the page pointing to the same page, then link text will be taken from the first link, and the second one ignored. (First link being the first one found in the document top to bottom).
Links which are no followed carry no link juice. In other words, they don’t count towards the linked pages’ page rank / Google ranking. You should use nofollow when linking to unknown websites (for example, if you allow your users to link to third party websites, and you don’t trust their judgement). Another reason would be when you use your affiliate code to link to a website. e.g.
<a href="websitesellinggoods.com/?affiliate=me" rel="nofollow">Buy me</a>
There are a number of structural HTML elements that are noteworthy for SEO. Best practise? Put keywords in your Heading tags, using them to separate the article into sections. Starting with the most important keyword, and moving onto longer tail ones. For example:
<h1>150 Amazing Motorbikes from around the Web</h1>
<h2>Yamaha R1 Motorbike</h2>
<h3>Selection of Yamaha R1 Pictures</h3>
Proper heading tags are important to help define relevance for organic keywords in the search engines, and specifying a range of headings helps to segment the page for engines to determine ‘sections’ on the page, and the important topic being discussed.
Title tags shouldn’t be forgotten about – they are another way to include your pages’ keywords, and should be written for humans to increase click through.
If you’ve an image based site with no idea how to generate organic traffic to it – SEO for images is worth an article on its own.
A more recent development from Google is the announcement surrounding their page layout algorithm, which basically means you should take ad position into account within your site design. If you’ve a shed load of ads above the fold. (yes A/B testing shy designers, that so called imaginary fold that doesn’t exist, and doesn’t affect ad click through rates) and most of your content beneath it, you could face an algorithmic penalty.
When Google finds two pages the same on a site, typically they discard one of them, and decide which one should make its way into the index. When users find multiple pages with the same content – there is no way you can predict which one they will share across the social web, or which one they will link to.
This creates a split link equity between pages with similar but not identical URLs. Sometimes, with the consequence that pages which people have linked and should be passing link juice from around the web, aren’t even in the search index in the first place.
A much better scenario is to always have one URL that people always share or link to. If this isn’t technically possible then you can solidify all of that link equity with the rel=canonical tag. It’s usage basically says ‘ Hey Google, this page is the master page ‘ any links that I get, should really be attributed to this page. Again, Matt Cutts explains things in a video, and there’s a comprehensive Google support article on the topic. Finally, duplicate content can also affect your crawl budget – discussed later in this article.
Duplicate content often occurs on sites which have multiple parameters passed through to the same page, which modifies the outcome slightly, but keeps the main content the same. For example, an example I often use is an e-commerce store, with multiple size available – but the same product description. See this previous article under ‘Use canonical tag where appropriate‘.
Many web applications make use of paging to navigate through search results. Often this will take the form of a list of links numerically ordered, or next and previous buttons.
Sometimes, you will accidentally create a duplicate content scenario as a result. For example:
May be the exact same page. Again, REL canonical or a 301 redirect on the latter can help solve this problem. Google suggest that developers use other REL elements in their markup, particularly for next and previous navigational elements. You can read more about that over here.
Having keywords in your URLs and ensuring that they don’t change is an important step to signal to Google what a page is all about. Friendly URLs are easy for users to remember, and relevant to your content topic. You should take care to separate words in your URLs with dashes rather than underscores.
There are a number of URL rewriting solutions out there for developers who want to focus more on SEO. Mod Rewrite for Apache , .NET rewriting, Mod_Perl. Other languages based around MVC principles, typically have URL routing built in out of the box, and its just a matter of configuring it.
Status Codes and Redirects
If you are developing your own rewrite engine, its important to understand the different types of redirect and status codes that you can set programmatically. All are no equal, and setting response header status codes is a technical pursuit that will make a massive difference to your SEO.
200 – OK – The request has succeeded. Everything is fine, and the document will be returned to visitor or crawler.
404 – Not Found – Typically when Google finds this status code being sent back from a page, it removes it from its index. It’s entirely possible to get things hugely wrong and return a 404 when you should be returning a 200 OK, resulting in no search traffic.
410 – Gone – Google treat 410 gone as a more permanent removal of a URL – so if you have a cast iron guarantee it will never return then it may be more appropriate than a 404. This may affect the cached copy of a page Google holds for you.
301 – Moved Permanently – a permanent redirect will discard the original URL in the index, and replace it with the destination. A 301 can indicate which URL you want to treat as the preferred canonical URL. 301 Redirects don’t pass all page rank, so you can expect there to be a slight dip in a page which has been redirected.
302 -Moved Temporarily. – a temporary redirect will bring the user to the correct destination, but should the search engines determine that the original URL is preferred, this is the one which will be kept in the index. Matt Cutts explains in depth here. – Microsoft Dev’s take note – 302 is typically returned from Response.Redirect.
500 – Error – a 500 status code being returned indicates an error or problem. You should log these when they occur – especially as sometimes bots can do strange things with your application, or not pass variables as you would expect between pages. Knowing when this occurs, and seeing that the user agent was Googlebot can be a proactive way to avoid needing an SEO review or consultancy.
When crawling, Google visit websites in descending page rank order. This means that if your site has a high page rank, chances are, it will be visited and crawled before other, smaller sites on the web. That’s the first thing you need to know. The second, is that the number of pages that Google crawls, is directly proportional to your page rank. It is much easier to have all of your pages fully indexed, if you’ve got a site which has lots of external links pointing at it, and thus a higher page rank – site architecture helps, but its not the only thing that determines if you get all your pages into Google.
Having duplicate content on your page can affect how many other relevant pages are indexed according to this Eric Enge interview with Matt Cutts – which is a great read for anyone interested in SEO.
Other things which can affect your crawl quota, are Google getting confused, and wasting crawl budget on pages which don’t matter. This SEOMoz article is a fantastic resource for analysing your server logs, and deeply understanding from a technical perspective what and how Googlebot is moving around your site.
As its a bit of a read, I’ve also created this companion slideshare, which summarises the details found in this article. Feel free to share and distribute with your own dev teams.