OK, now that I’ve answered the question of canonical URLs, let’s get back to that pesky duplicate content issue. Hacking your .htaccess file solved one problem, but what if you have different URLs which all point to the same content? Something like this:
Let’s assume all of these pages have the same content on them, because they are, well, the same page. You only want Google and Yahoo to see one of them, so that their crawlers don’t have issues with duplicate content. You have two tools which can assist you (besides defining other more complex .htaccess file rules and 301 redirects) – the robots.txt file and the sitemap.xml file.
You can use the robots.txt file to tell search engines which URLs not to index, and you can use a sitemap.xml file to tell search engines which URLs are the most appropriate. For example, to fix the above problem with duplicate URL content, you can add the following to robots.txt:
This will tell search engine crawlers that any URL which starts with /index.php?page or /recent-pages/ should not be included in indexing. This will not only block the above duplicate URLs, but all other URLs with that string pattern in them. So this will also block:
Next we want to define which page Googlebot and Slurp (the Google and Yahoo spiders, respectively) should index, and we can do this by creating a sitemap.xml file for our website. There’s a bit more to it than this, but for a quick example here’s what you would add:
Some sitemap files can become extremely large, especially if you have tens of thousands of web pages. In these cases you would generate the file programmatically and break the sitemap XML file into several smaller ones. But they all serve the same purpose, which is to assist search engine spiders in knowing which page you want them to index. You can read more about how Google treats sitemap.xml files on the Google Webmaster Tools site.
But please keep in mind – Just because you submitted a sitemap.xml file it doesn’t mean Google is going to index all your web pages!
Sitemap XML files are not the end-all-be-all for spidering or indexing in Google! It only assists the Google spider in knowing which pages are the ones you want indexed; if your site internal link structure is not properly defined and every page is not linked internally, having a sitemap.xml file is worthless to you.