Duplicate content can be a troublesome issue. Internal duplicated content will stop a site ranking. Quite often content management systems, especially Magneto, churn out large quantities of pages on different URLS.
In in this simple example, 3 product pages are crawled and indexed by the search engines.
This causes a couple of key problems.
- The pages compete against each other. With no clear direction, the search engines do not know which page to rank and the normal result is that all the pages plateau. This results in low rankings across the board. No matter how advanced Google’s algorithm gets, it still suffers from confusion when faced with multiple pages containing the same content. It needs simplicity and clear instructions.
- There is some debate over if Google penalises or devalues a site for having duplicate content is an issue up for debate. Matt Cutts tweeted that duplicate content is not an issue. However, my own personal tests suggest otherwise. My guess is there is some form of ratio between the authority of the website and the percentage threshold with external duplication that if passed allows a site to a free ride past their duplication. There has been a lot written about Buffer’s move to syndicating their content,
Take this example, I worked with a site that sold a pretty generic product, office supplies. Their website lived off a manufacturers feed. By this their website basic structure and product descriptions were populated. We could populate the blog and the content at the category level, which were both bare.
Due to the website receiving the majority of its content through the manufactures feed it was riddled with duplicate content (as many other sites received the same feed). We filled the category levels with a huge chunk of relevant quality content and canonical the product pages back to the category level. Suddenly the site became unstuck and started to gain additional traffic.
I am sure that duplication does not affect large authoritative sites but for the smaller guys, I’m not so sure.
- 1 How to Find Duplicate Content
- 2 How to fix the duplicate content
- 3 Internal
- 4 External
How to Find Duplicate Content
If a site is not ranking, here is my process of finding potential internal or external duplicate content.
- Run the site though Deep Crawl
- Run the site though Screaming Frog
- Run the site through Copyscape
- Run the site through Siteliner
- Manually search for duplicate content
- Google’s Search Console (Formally Webmaster Tools)
1, Deep Crawl
Deep Crawl is one of my favorite tools. For anyone who is technically minded it is a must and a massive time-saver.
The Deep Crawl site analysis takes the longest to process, so it is a good idea to start the crawl running first.
Log into deep crawl and if a project has been set up already then you can simply rerun the crawl.
If it has not been set up already set up a new project then a new project will need to be set up.
Just walk through the process of adding in the XML sitemap and Analytics integration and you are good to go.
You will be faced with a wealth of information, for the specifics of this post we are going to be concentrating on:
- Duplicate Pages
- Duplicate Titles
- Duplicate Descriptions
Click into the each one and export.
2, Screaming Frog
The free version of Screaming Frog processes 500 URLS, so it is a really powerful tool to use in your arsenal if you own a small to medium website. If the site is over this amount you will need the paid version which is £99 yearly.
To look for potential duplicate content using Screaming Frog, just enter the URL into the search box and click ‘Start’.
Screaming Frog will give you a huge amount of information but for this task we are only interested in:
- Page Titles
- Meta Descriptions
Go to the ‘Page Titles’ tab and then use the filter setting to just select ‘Duplicate Titles’.
If Screaming Frog find pages with duplicate Titles, it either means that you need to do some housekeeping and edit your Page Titles so they are all unique and targeting the pages main keyword or your site is producing duplicate HTML content. Export the crawl and look at all the pages
Repeat the same process with ‘Meta Descriptions’ just to be on the safe side.
Copyscape is a tool that checks a website’s content for any duplication across the web. While this tool is not perfect it is a good place to start. By entering in a URL into the search bar Copyscape will check the contents of the page for external duplication across the web.
If you have a large site the premium version allows you to bulk check up to 10,000 pages.
Siteliner is a tool that checks a site for internal duplicate content. Just put in your URL and the site will give you your internal duplication and a few other useful bits such as page speed.
5, Manual Search
The final stage is to use Google to search for duplicated content. Even though this is the final stage, whatever you do, do not skip it. This is often the most important, even if it is a little hit and miss. There is no clear methodical process for this step – it involves skimming through the results and looking for search results that do not fit.
5.1. Paragraph content
Take a few sentences from the homepage and search for that sentence in Google. Use quotation marks around the sentence. This will tell Google to ONLY return pages with that exact string within the text content.
Repeat this step across main categories and product pages. It is not necessary to do all the pages, just spot check until you are happy.
This will show if the CMS is duplicating pages, if the content has been scraped or if it has be duplicated e/g/ manufacturers product description.
5.2. Site search
Perform a site search
This search will give you the number of pages indexed in Google for your chosen site. You should be able to make a rough estimate of the amount of pages that the site should have indexed. Compare this to the amount of paged indexed from the site: search. This is just a rough estimate but if the numbers differ massively you will need to go through the indexed pages one by one.
Click the end SERP page in Google and look for:
If you see this warning there is definitely an issue with duplicate content. Click the last page and then back to the end result and look for the differences in the two SERPs.
There are a long list of potential issues but a few of the most common are:
You should also look for the different variations, such as HTTPS, HTTP, and www. Subdomain, and non www. Subdomain. To do this, scroll through the site: search results and look for any different variations. If you find any there may be an issue with you redirects to the correct version using a 301 permanent redirect.
As query strings perform a variety of functions, if you find query string indexed you will need to check further. What is the function of the query string, does it sort he content of the page? Does it filter the content of the page? You will need to delve deeper and look at the original page the query string belongs to see if the string is causing duplication.
It is advisable to look for consistency with a URLs trailing slashes. It may not seem like a big deal, but the following two pages are classed as different pages:
It is reasonable to assume that the trailing slash in this case is a by-product of the CMS and therefore the content on the pages will be an exact duplication.
6, Google’s Search Console
Google’s search console, formally Google’s Webmaster Tools, provides some data that can be useful for finding internal duplication. Pay close attention to the information that they give you. If Google is advising you to fix issues with your site it is probably worth doing.
Login to your Search console and select the correct view. In the left hand navigation, click ‘Search appearance’ and then ‘HTML improvements’.
Look for ‘Duplicated titles and ‘Duplicated Meta descriptions’, sift through the lists and see if any of the content is duplicated.
How to fix the duplicate content
Internal duplicate content can be a fairly easy fix. Once you have identified the offending pages you have a number of options:
Rewriting the content is probably the least desirable option. Only use this option if you have used the same content on different pages. An example of this is if you own an ecommerce store and you have different variation of your products and you have used the same product description across all the variations.
For example, you have a store that sells widgets. The store sells 5 different coloured widgets: blue, black, white, yellow and brown. Each product page has 600 words of content, but because the only variation is the colour, the same product description is across all five pages.
Let’s say in this example the keywords:
- Blue Widget
- Black Widget
- White Widget
- Yellow Widget
- Brown Widget
All have extremely high search volumes and it would be beneficial to rank well for all these terms and not just the main target term ‘Widgets’.
In this example, each page would need a unique well written product description that describes the product and its uses in an original way.
Canonicals seem to confuse a fair few people, however they are my go to method of removing duplicate content. The concept is simple, if you have two pages that have duplicate content, like the example below.
Ideally, you want your URLs to contain the category structure as this gives the website a structure. This means, in this case:
www.domain.com/product contains the DUPLICATE CONTENT
www.domain.com/cat1/product contains the ORIGINAL CONTENT
In the above example, you would add the canonical tag to the page that contains the duplicate content and point it to the page that contains the original content. This code tells the search engines to remove the duplicate content from its index and pass of the value to the page with original content.
By simply adding the NOINDEX tag to the <head> section of the duplicated pages you can be an easy and straight forwarded way of removing eh duplication from the search engines index.
Redirecting the duplicate content page to the original content page is another option open to you. This option does come with its downsides, for instance, it may result in a large amount of redirects in place which can get easily out of control. If the CMS or category structure changes this could result in a complete mess. However it will result in less pages being crawled by the search engines and will increase the crawl space.
As mentioned previously, there is some debate over if duplicate content is an issue.
This the extreme example at the start, if you own a site that pulls its product descriptions and text content from an external source it might be worth filling your category pages with unique well written content and then use the canonical tag from your product page to the category pages.
Alternatively if you have two pages that have duplication, you can add the canonical tag to the page that you do not want to be indexed.
Contact the website owner
One option is to contact the website owner and ask them politely to remove the content. I have tried this method a few time with 0% success, but for the time it take to write an email, it might be worth giving it a go.
Rewriting you content is a last resort. The real critical factors are:
1, how big your site is
2, how many resources you have
In most cases it is not cost effective to rewrite thousands of pages of content.