Thin and duplicate content: What news SEO need to know

In issue 31, Shelby looks at content Google does not reward: Thin and duplicate content. We look at how best to optimize duplicates (think: wire stories), and how to avoid thin content.

Sep 20, 2021

Happy Monday, friends! Shelby is in the driver’s seat this week, taking you through some more SEO for news fun. It has been a very busy time outside of this newsletter, but I’m happy to be sipping some tea talking shop with you all.

Today, we’re going to go a bit deeper into one of the topics we explored in our technical SEO 101 – thin and duplicate content.

This will give you a better understanding on how to deal with pages on your site that, to you, may not need additional optimizing, but for the sake of Google’s ranking criteria, need a bit of help. We’ll discuss how to do this effectively without keyword stuffing or ruining your rankings.

A laptop computer sits on a table with an empty notebook and cup of coffee. — Photo by Nick Morrison on Unsplash

So, let’s talk about thin and duplicate content in depth.

In this issue:

What is thin or duplicate content?
Thin content solutions
Duplicate content solutions

THE 101

What is thin content?

Thin content is content that Google deams to have little or no value (it’s the opposite of E.A.T. content – which Google wants more of). Content should not be posted for the sake of posting. Stories should add value to the reader’s life, whether that is a personal opinion, originally reported advice or never-before-seen photos.

Thin content in news could render as a podcast landing page, the subscription or donation page or even an article that just has a series of links on it.

Before this signal, site owners would attempt to improve their site’s rankings by creating pages with many words, but little value. With Google leaning heavily into E.A.T content (expertise, authority and trustworthiness), it’s more important to provide valuable information to the reader and not just run a new story for the sake of publishing.

What is duplicate content?

Duplicate content is a type of thin content. It is provided more than once on the internet in the same or very similar way. This can become tricky for search engines as they do not know which version to include or exclude from their indices or which URL to rank. This is why you may sometimes find the same wire or article on different sites all ranking for the same keyword.

Duplicate content can impact your rankings. If a crawler finds the same story on four different sites, it will use a series of ranking factors (such as the URL, links to the story, headline, etc.) to decide which to provide users, and which to hide.

News SEOs are probably thinking of one thing: wires. Wire services provide news reports for media outlets that may not be able to get the story right away, but want to provide information to their readers. Wires are syndicated by news organizations such as Reuters, Associated Press, Canadian Press, The New York Times, etc., and provide copy that needs little to no editing. The story can be published multiple times by any news organization that has an agreement with the service.

Wires are a major contributor to duplicate content. Many, many, many news organizations have access to wires and will publish it as is, sometimes with the exact same headline. This can cause multiple versions of the same page to exist on the internet (like during the Kabul airport attack).

We’ll talk about how to optimize wire copy more in the future, but it is important to remember that even if a wire is perfect from an editorial standpoint, we need to optimize it for your audience.

Attention news folks: Duplicate content and thin content are particularly big concerns for news organizations because we publish stories on a regular basis. It is very possible we accidentally publish the same story twice or have another affiliate site pick up the story and publish it. This is why it’s so important to understand the concepts.

🔗 Read more: What is thin content? And how do I fix it?

THE KNOW HOW

How to fix thin content on your news site

Avoiding thin content is your best approach. Quite simply, every form of information on your site should provide value to your audience. If it doesn’t, the story shouldn’t exist.

Take an inventory

To fix thin content, you want to start by taking stock of all pages that may be considered “thin.” To do this, go through the pages on your site – both navigational pages (about, contact, subscribe) and the news pieces you feel should be ranking – and make note of their substance.

Consider the following:

Does this page provide less than 200 words of text?
Does this page give the reader what they need to know? In simple terms: Is the page useful?
Can the page be easily read by humans and robots alike?
Does this page exist anywhere else? If so, should this page or the duplicate be the true version?

If the page provides less than 200 words of text, does not give the reader what they need, is not easily readable and/or exists elsewhere, you have a piece of thin content.

According to Google, these are the ways to deal with thin content:

Create a useful, information-rich site and write pages that clearly and accurately describe your content;
Use appropriate keywords where appropriate. Don’t keyword stuff;
Make your pages crawlable (i.e., no Javascript running, loadable, etc.).

Using these, let's correct the pages that need fixing.

Audit the cached version of your page for technical issues

Look at the version of your pages Google has stored. This is the cached version and may give you clues as to why Google is not indexing your page.

In the address bar, write `cache:` followed by the page you’re looking at. This will show you the most recent version of your page that Google stored.

While looking at this version, check if everything loads properly. If it doesn’t, it means Google couldn’t render the elements and did not index it for the page. This could be why the page is not ranking.

Often, this issue is created because of one of the following:

Javascript running and blocking a crawler from seeing everything;
Extremely long load times;
Too many elements on the page.

Use the cached version of the page to identify if you can see the issue. Work with your site team or developers to correct the issues.

Find the page’s duplicates.

If the page has duplicates elsewhere – whether on your site or other sites – identify them and make note.

You can use this handy dandy tool called Copyscape to find duplicate versions of your stories on other sites. These are more than likely published without permission or as direct duplicates.

Lucky for us, Google has become smarter at recognizing that republishing sites or sites created for the purpose of spam links are not what readers want. However, having knowledge of which sites are republishing some stories is good to know.

If you or your site are republishing pieces on sites like LinkedIn or Medium, there are steps you can take to not create duplicate content. Ensure your pages have a canonical URL set to its original version. If you use a wire service, sometimes they will have a clause in their contract that does not allow you to have full sharing capabilities of a story. If that’s the case, also ensure your robots directive is set to noindex.

Add some text.

It seems so simple, but it’s true. If your page has less than 200 words, you will absolutely need to add some form of additional text to the page.

Two hundred words may not seem like a ton, but when you’re describing your podcast or a newsletter, it may seem counterintuitive to write 300-plus words about the host.

A few suggestions for good user information:

Be descriptive of what folks can expect from this particular product;
Include the names of those who will host, write or compile the product (if it makes sense);
Include any FAQ questions readers may have (you can find these through AnswerThePublic or on Google when you search the product, service or topic).

Best practices:

Audit audit audit. Always look for what’s happening on your site and how you can improve it.
Ensure thin content does not exist unless absolutely necessary
When in doubt, add some valuable assets to a page to improve the user experience.

The bottom line: Value is subjective, but Google’s ranking signals are all around E.A.T and quality content – the crux of journalism. Make your value shine with all pieces you publish, and don’t publish for the sake of publishing.

FUN + GAMES

SEO quiz

Which of the following is not a search engine?

SearchTeam
DuckDuckGo
Shodan
Doogle

RECOMMENDED READING

More information on how Google is generating title tags
A new SEO traffic tool and how to beat the Google rewrites
Bubble links are showing up in U.S. SERPs
Mozilla is testing Bing as its default search engine for 1% of the population
Google is updating podcast requirements

Have something you’d like us to discuss? Send us a note on Twitter (Jessie or Shelby) or to our email: seoforjournalism@gmail.com.

(Don’t forget to bookmark our glossary.)

FUN + GAMES

The answer: 4. Doogle is not a search engine, but these unique ones have some fun names.

Written by Jessie Willms and Shelby Blackley

WTF is SEO?

Discussion about this post

Ready for more?