duplicate-content

Code Samples,Search Engine Optimization

SEO Issues with Duplicate Content: Htaccess, Robots and URLs – Part 1

4 Oct , 2008  

There’s a lot of important SEO issues to talk about here, so I am breaking this blog post into two parts.  First let me start by explaining why this blog post came about.  I recently encountered a problem with a client who was on a shared hosting platform with really bad tech support (you know the type – godaddy, 1&1, hostgator, etc.). The problem was the site’s home page was answering on too many URLs.  For example, the following URLs all delivered the same content:

http://yourdomain.com
http://www.yourdomain.com
http://yourdomain.com/index.php
http://www.yourdomain.com/index.php

For a human visitor, typing any of these URLs into their browser would not present any problems.  In fact, humans don’t care how they get to a website, as long as they get there.  But search engines like Google, Yahoo and Microsoft Live do care – and sometimes can have quite a problem with the above situation. Here’s why:

You’re spreading all your potential incoming PageRank power to four different URLs, rather than just one – that’s bad SEO.  Your website gets PageRank from other websites – the more sites that link to you, the more important Google thinks your website is.  And if more important people link to you (those which themselves have a high PageRank), the better your PageRank.   So if your incoming links are pointing to the above 4 different URLs, each ones gets it’s own PageRank, instead of combining it all into one more powerful PageRank.

2 Copy Cats - No one likes Duplicate ContentSearch engines, especially Google, hate Duplicate Content.  Duplicate content is best defined as the same content which resides on more than one URL.  That URL can be on your own site, or it can be on another website.

Google’s official stance on why they don’t want duplicate content is something like “users typically want to see a diverse cross-section of unique content when they do searches.”  But I think the reason Google really hates it so much is because of problems with MFA (Made for Adsense), or screen-scraper websites.  You’ve seen these unethical sites utilize blackhat SEO – they steal, or screen-scrape, content from other websites and post it on their own with the intent of stealing search engine traffic from the original content source.  As a result, they hope the visitors will click on their Google Adsense ads so they can make a few $.

So what does Google do with duplicate content?  It takes the first couple of results which it deems are either the most original, the most relevant, or the most authoritative, and displays them when users do a query.  The rest of them it lumps into a duplicate content black hole, where they are rarely seen or heard from again.

If you do a Google query on content which you know has been duplicated, you’ll see something similar to the following:

"In order to show you the most relevant results, we have omitted some entries very similar to the 7 already displayed. If you like, you can repeat the search with the omitted results included."

Yahoo will give you almost the exact same message:

"In order to show you the most relevant results, we have omitted some entries very similar to the ones already displayed. If you like, you can repeat the search with the omitted results included."

So what does this have to do with SEO and URLs?  Well, you don’t want Google or Yahoo’s search engine crawlers to have any issues with any of your content, so why not make it easier on them by describing exactly which URL you want it to index?  You can do this with a number of tools.

First, use an htaccess file (on Linux Servers) to clearly define which is your Canonical URL and use 301 redirects to make sure your webserver answers on it.  Canonical URLs are a difficult SEO subject for the amateur webmaster or blogger to understand, so I will try to break it down in a little more detail than in my previous post, SEO case study on 301 redirects.  Here’s how Google’s Matt Cutts describes it:

Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages.

So let’s say you want to make http://www.yourdomain.com THE canonical URL for your home page (and all the other pages on your site). You need to make sure anyone who types in http://yourdomain.com gets redirected to http://www.yourdomain.com, and any links pointing to http://yourdomain.com get redirected to http://www.yourdomain.com. You can add these lines to your .htaccess file:

Options +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^yourdomain.com [nc]
rewriterule ^(.*)$ http://www.yourdomain.com/$1 [r=301,nc]

If you’re using WordPress, or a number of other Content Management Systems out there, this has already been done for you.

Now when search engines crawl your site they will always see your web pages on only your canonical URL, and will index them appropriately.

Read the next article in this 2-part series where I discuss how you can fix duplicate content Issues with other tools, such as the robots.txt file and a sitemap.xml file.

, , , , , , , , , ,


2 Responses

  1. Google webmaster tools has a feature where it will show duplicate title and description tags.

  2. Thanks for a comprehensive article, Barry. I also recommend webmasters implement canonical URLs as best practice in basic SEO.

    Also, I’m working on a related tool, you may find it interesting. It allows you to create 301 redirects automatically.

    http://www.301redirector.com

    It’s aimed at larger sites with many pages where creating redirects manually would take too much time.

Comments are closed.

SEO Issues with Duplicate Content: Htaccess, Robots and URLs – Part 1

by Barry Wise time to read: 3 min
2