Understanding SEO issues related to Duplicate Content
Duplicate content is very common on the web. It comes in the forms of content syndication, mirror sites, content scraping, quoting, reusing content, data fed sites etc.
Duplicate content is a challenge for the search engines for at least the following 3 reasons:
There are essentially 2 types of duplicate content - duplicate documents and query-specific duplicate documents.
Duplicate documents are web pages that are the same or almost the same. The duplication here operates on the whole page, unlike…
Query-Specific Duplicate Documents
Sometimes documents can be quite different on the whole, but share common paragraphs. Examples:
In essence, when two documents are different on the whole, but are detected to share common parts, these are called query-specific documents. Why query-specific? Because this duplicate detection is based on the user supplied query (more on this later).
At the moment all major search engines are good at detecting whole duplicate documents, but Google is the only master at filtering documents that only share duplicate parts of them (query-specific docs).
The above reason makes Yahoo and MSN open to content spamming currently (for example SERPs dominated by the same data fed sites). I guess, Yahoo and MSN will follow Google's aggressive duplicate content filtering and the future ain't bright for all forms of duplicated content.
I will focus my article on Google's patents because Google is most open to its algorithms and currently has the best duplicate content filtering technology.
Before I continue, let's repeat again the major difference between duplicate documents and query-specific duplicate documents. Duplicate documents are essentially the same or almost the same page. Query-specific duplicate documents can be quite different but share duplicate parts in them.
Detecting Duplicate Content
Google detects the different types of duplicate content at the different stages of operation of the search engine.
Whole duplicate pages are detected after a page is crawled. Google generates fingerprints that are compared to the fingerprints of the other pages in the repository. At this "after crawling / before indexing" phase, whole duplicate content pages are discovered and labeled.
Detecting duplicate parts of pages is a tricky business because the number of combinations of page snippets to compare is astronomical. Google detects query-specific duplicate pages at query time.
Here's an overview:
A visitor places a search query. Google generates the top 1000 most relevant pages according to its current algorithm.
For every page in the final set of 1000 pages, Google fetches the raw html from the data repository (not the index), cleans it from tags and possibly stop words. After that Google scans every page in this final candidate set for the query terms. Google pulls the parts of the pages that have the most keywords in them.
Finally, Google compares these query-specific snippets with the snippets pulled for the other pages in the final set and if there is a match (exact or close), Google will not show the page with the lower relevancy score.
Within this final set of 1000 pages, Google tries to filter out pages that offer the same content related to the query. If query-specific duplicate pages are detected, Google shows only the page with the highest relevancy score. Basically, all query-specific duplicates fight for one place and only one page gets it. All other pages are omitted from the results (note: omitted does not mean penalized!).
Let me give an example. You have an affiliate data fed widget store. You pull widget product information from an xml feed. This same feed is used by thousands of other affiliates.
Your site shows the widget descriptions taken from the feed, which are essentially the same as all other affiliate sites. The product description is not unique, but your text around it (your navigation and other texts) is unique.
Google crawls your widget site. After crawling the pages, Google does not detect that your pages are duplicate to other pages (based on your unique navigation).
At query time a potential customer issues a query related to widgets. This query uses keywords that are found in your duplicate product descriptions. Your page gets in the final 1000 top ranked pages. However, let's say 100 other affiliates get in the top 1000.
Now, Google scans all pages for query related snippets. Because the query uses keywords that are used heavily in the data fed descriptions, the generated snippets for your store and the 100 other affiliates is the same. Google will show only the top ranked page and all the rest will be omitted from the results.
If you want to rank such a site, your pages need to be the top ranked out of all other affiliates.
This final query-specific document detection is what currently separates Google from Yahoo and MSN in terms of duplicate content removal from the SERPs.
Google also uses the final query-specific dup detection phase to generate the actual snippets of text shown within the SERPs.
As you see from the above example, it is very difficult to rank high when the target keywords are used within duplicate parts of your pages.
I want to make a point very clear. Google does not penalize duplicate content. Google simply shows only one page and filters the other duplicates.
Let's go back to the 2 types of duplicate content and their implications.
Whole duplicate documents are detected between the crawling and the indexing phase. The purpose of this detection is to:
Query-specific duplicates are detected at query time. The purpose of their detection is:
My Duplicate Content Recommendations
Here's a real example.
Search Google for "diprolene af" without the quotes (that is a cream sold at many dup content online pharmacies). Instead of showing 1000 results, (at the moment) Google shows 89.
There is a text at the bottom of the last page that says: "In order to show you the most relevant results, we have omitted some entries very similar to the 89 already displayed. If you like, you can repeat the search with the omitted results included." Click the link to search with the omitted results included, and you will see 966 (at the moment) results (they are not 1000 because other pages are filtered - probably pages from the same domain as other listed pages).
In this extreme example, Google lists about 9% of all the final 1000 page candidates.
I think in the future, we will see duplicate link detection (detecting and devaluing duplicate anchor text and link descriptions for incoming links) but I am going to write about this in the future.
Remember the golden rule of duplicate content: make it unique or hide it from search engines.
... and read the damn patents.
Latest SEO Blog EntriesShareware Marketing 101 - 28 August 2006
Google Webmaster Central To Solve Canonical Issues - 16 August 2006
Forum Upgraded With Signatures and Avatars - 13 July 2006
Climbing the Keyword Ladder - 17 June 2006
PayPal Is Not Enough - 12 June 2006
SEO Guide Gets a New White-Grey-Black Hat Skin - 06 June 2006
Matt Cutts On BigDaddy, PageRank and The SandBox - 17 May 2006
Google Patents On PageRank Variants - 11 April 2006
Microsoft Paper Gives Clues Into The Future of SEO - 10 April 2006
Focus On The User And Get More Traffic And Revenue - 09 March 2006