Code Samples,Search Engine Optimization

SEO Issues with Duplicate Content: Htaccess, Robots and URLs – Part 2

6 Oct , 2008  

This is Part 2 of a series on SEO and duplicate content issues. In the first part I discussed using your Apache .htaccess file with 301 redirects on Linux servers to fix canonical URL problems.

OK, now that I’ve answered the question of canonical URLs, let’s get back to that pesky duplicate content issue. Hacking your .htaccess file solved one problem, but what if you have different URLs which all point to the same content?  Something like this:

http://www.yourdomain.com/my-page.html
http://www.yourdomain.com/index.php?page=my-page
http://www.yourdomain.com/recent-pages/my-page.html

Let’s assume all of these pages have the same content on them, because they are, well, the same page. You only want Google and Yahoo to see one of them, so that their crawlers don’t have issues with duplicate content. You have two tools which can assist you (besides defining other more complex .htaccess file rules and 301 redirects) – the robots.txt file and the sitemap.xml file.

You can use the robots.txt file to tell search engines which URLs not to index, and you can use a sitemap.xml file to tell search engines which URLs are the most appropriate. For example, to fix the above problem with duplicate URL content, you can add the following to robots.txt:

Disallow: /index.php?page*
Disallow: /recent-pages/*

This will tell search engine crawlers that any URL which starts with /index.php?page or /recent-pages/ should not be included in indexing. This will not only block the above duplicate URLs, but all other URLs with that string pattern in them. So this will also block:

http://www.yourdomain.com/index.php?page=my-page2
http://www.yourdomain.com/recent-pages/new-page.html

Next we want to define which page Googlebot and Slurp (the Google and Yahoo spiders, respectively) should index, and we can do this by creating a sitemap.xml file for our website. There’s a bit more to it than this, but for a quick example here’s what you would add:

<url>
<loc>http://www.yourdomain.com/my-page.html</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>

Some sitemap files can become extremely large, especially if you have tens of thousands of web pages. In these cases you would generate the file programmatically and break the sitemap XML file into several smaller ones. But they all serve the same purpose, which is to assist search engine spiders in knowing which page you want them to index.  You can read more about how Google treats sitemap.xml files on the Google Webmaster Tools site.

You can also view my sitemap.xml file, which is automatically generated by the Google Sitemap Generator Plugin for WordPress by Arne Brachhold.

But please keep in mind – Just because you submitted a sitemap.xml file it doesn’t mean Google is going to index all your web pages!

Sitemap XML files are not the end-all-be-all for spidering or indexing in Google!  It only assists the Google spider in knowing which pages are the ones you want indexed; if your site internal link structure is not properly defined and every page is not linked internally, having a sitemap.xml file is worthless to you.

, , , , , , ,


4 Responses

  1. Hi Barry, this is indeed a real resource for all bloggers. You may wish to see my post as well on duplicate content penalties for Blogger. I found that using ‘recent comments’ widget in Blogger causes duplication of content. Also, the monthly archives in Blogger is not disallowed in search index using robots.txt directive. Good to see you self-hosted people taking advantage of robots.txt. Blogger doesn’t provide that option.

    Lenin

  2. silicon loop says:

    Very helpful post, but I have a question that remains unanswered. How about the situation where you have a site that’s built on a CMS, and you want to put a blog at one of the ‘directories’ of the site. Say you want a WordPress blog there.

    You want it to look like:
    mysite.com/blog

    But because of the limitations of the CMS you have to install the blog elsewhere, say at blog.mysite.com.

    Now how do you handle URL rewriting in this case, so that the blog appears to be at mysite.com when it’s really on a subdomain. And how do you prevent the subdomain (blog.mysite.com) from being indexed?

    Thanks.

  3. Barry Wise says:

    silicon; Matt Cutts of Google has stated that there is essentially no difference between a subdirectory and a subdomain on a site. In Google’s eyes,
    mysite.com/blog = blog.mysite.com

  4. Thanks for this Barry. Need to brush up skills regarding robots.txt

Comments are closed.

SEO Issues with Duplicate Content: Htaccess, Robots and URLs – Part 2

by Barry Wise time to read: 2 min
4