Eliminate Duplicate Content Using robots.txt

In my last I put together a quick SEO checklist and I’ve been meaning to expand on some of those points.One issue in particular I’ve had to deal with recently is duplicate content.

I’ve been using the SEO in Firefox extension to check various stats about my site from time to time. It gives you a quick snapshot of your site including, but not limited to, PageRank, Alexa rank, the number of cached pages, and what caught my eye yesterday, how many of those pages are in the supplemental index. It was showing that I had 680 pages in the index, and all of them were supplemental.

As it turns out when I was doing the redesign of this site I forgot that each tag you create with Ultimate Tag Warrior creates its own page with the full content of any posts with that tag. All of these tag pages had been indexed by Google and must have been triggering a duplicate content flag.

So the steps I needed to take were:

  • Prevent Google from indexing these pages in the future.
  • Remove the offending pages from the Google index.

To prevent Google from indexing these pages I went to my robots.txt file.

Here is what mine looks like:

User-agent: *
Disallow: /*/feed/
Disallow: /*/feed/rss/
Disallow: /*/trackback/
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/
Disallow: /tag/
Sitemap: http://seoandtips.com/sitemap.xml

Here is a quick run down of what each of those lines do:

  • The first line (User-agent: *) specifies which web crawlers the following directives should apply to. In this case the * means it should apply to all web crawlers.
  • Each of the “Disallow:” lines tell web crawlers not to index the directories specified, as well any subdirectories. I’ve added /tag/ to prevent those pages from being indexed.
  • The last option, “Sitemap:”  is a new one. All major search engines now support autodiscovery of sitemaps. You can auto-create a new updated sitemap after each post you make by using the Google Sitemap Generator for WordPress.

Now that my robots.txt file has been updated to disallow the /tags/ directory the next time Google crawls my site those pages will be removed from the index, and hopefully that should address the supplemental problem.

Google also provides a facility in its Webmaster Tools to request expedited removal of pages from the index, but you should be careful when using this because you can potentially remove your whole site from the index.

Leave a Reply

Your email address will not be published. Required fields are marked *