Discover more from WTF is SEO?
What is thin or duplicate content?
Thin content refers to pages with negligible value, while duplicate content refers to pages that are, well, duplicated. We cover how to find both — plus how to fix and avoid this content in news SEO
The Content Technologist is a free weekly newsletter that dives deep into advanced digital content strategies. Whether you’re trying to impress your boss, launch an ambitious web project, or propel your freelance business forward, find news, strategic guidance, and digital know-how to create more successful publications in the age of algorithms.
Hello, and welcome back. Jessie here, back from a week of dogsitting. That included plenty of time in the dog park (for both of us), some pumpkin treats (just the pup) and losing our mind any time a neighbour dared to use the building’s elevator (only one of us — but you guess who!).
This week: Thin and duplicate content, AKA, content we want to avoid. Thin content refers to pages with negligible value, while duplicate content refers to web pages that are too similar to others on the site. We’ll look at common news culprits, and what you can do to avoid having pages with limited value.
Thanks to everyone who attended our fall 2023 community call! We’ll be back in the new year with another!
Join our community of more than 1,450 news SEOs on Slack to chat any time.
Let’s get it.
In this issue:
What is thin content?
What is duplicate content?
What’s to be done about thin and duplicate content.
What is thin content?
Thin content refers to webpages that have little to no actual authentic information. They are low in substance, lack depth and expertise, possibly because they are auto-generated, AI-written or full of unoriginal scraped content.
In traditional SEO, these are often doorway pages: Low-quality affiliate pages or URLs that serve a ton of ads and pop-ups. Thin content can also include duplicate content (more on this later).
Thin content is one reason a website could suffer from a manual action, a form of penalty from Google that can result in your page — or site — not ranking.
Why it matters: Google wants to reward content that’s helpful and written by experts. Thin content is the opposite of what we want to create: Pages with high-quality, useful information that clearly serve the reader's needs, demonstrating strong E.E.A.T signals and builds topic authority.
Thin content should be avoided. It does not satisfy a reader's need, provide helpful information or fulfill a search intent. It can also contribute to keyword cannibalization — what happens when multiple pages on the same site compete against each other for traffic on a keyword, to the detriment of the overall performance.
Thin content is not good for readers. If it’s unhelpful, it’s likely to result in higher bounce rates, won’t generate backlinks or build topic authority.
Thin pages across a site can also bring down the overall authority.
We consulted, SEO consultant and author of the newsletter, SEO for Google News, to help compile a list of common culprits in the news. They include:
Tag or category pages (with very few stories attached);
Author pages for writers with a limited archive and information;
Pages with only a video embed (no text content or transcript);
Article pages for interactives or interactive graphics (think: charts only, no accompanying text);
Photo gallery pages that lack an introduction or have no text/captions;
Data pages, such as sports scores or stock pages.
Note: These types of pages aren’t universally a problem. They might have limited text, but Google understands thin content within context.
For example: A score or stock page doesn't necessarily need a ton of text because the purpose of the page is not that of a news story. Google knows to take what it needs and move on.
For example: When possible, URLs that house interactives should be readable via the HTML of the page. If it can’t be, include a 300ish-word introduction describing the story or interactive.
Don’t pad pages with unnecessary text to avoid the designation of thin. If the page fulfills a clear search intent, they should be fine.
Breaking news side note
Speed in breaking news matters. And while it would be great to publish immediately with a robust story, that’s often not possible. For breaking news that is highly searched, Google understands that the story is fluid. Publish a short (150 words) story, then update with more context and information.
Include a line that says the page will be updated (“This is a developing story. Please check back for updates.”);
Get the file online, then focus on providing helpful updates. Try to avoid publishing without a plan for updates. You want a solid file or live blog as quick as possible.
If that's not possible, consider whether it's worth sacrificing the quick hit for a better piece of content that targets another branch of the storyline.
How to find thin content
To find thin content on your site, use an SEO tool like Botify, which will flag URLs that could be considered thin. Screaming Frog will scrape your site and sort the URLs by word count (short content is often an indicator of thin content).
Manually auditing the URLs on your site is possible. Review each page for substance, tracking the pages that are thin on helpful information.
Finally, use Google Search Console to find pages with minimal traffic, and look to see if any are victim to a manual action.
If a URL is underperforming based on your expectations, it might be because it’s thin. NBC’s Andrew Coco says to assess how your page compares to those in the top positions in SERPs. Google is rewarding that content for a reason — what does it have that your page is missing?
Analyze your potentially thin content and ask: Is this helpful? Does it demonstrate E.E.A.T and contribute to my publication’s topic authority? If the answer is no, lay out ways to improve the page.
How to fix thin content
Work to expand, merge, delete, redirect or rewrite the page.
Use keyword research to uncover additional questions you can answer and expand the content accordingly. Focus on using the story to demonstrate E.E.A.T and build topic authority by showing strong expertise.
If you have several thin pages on a shared topic, consider merging into a single resource (and redirect the unhelpful pages).
Redirect older thin pages to more useful resources.
Delete pages that provide little to no additional value.
If none of the above applies, blocking the page from crawling in robots.txt or
noindexit. Some thin pages are needed for business reasons, but Google doesn’t need to spend time on them.
For thin tag pages: Tag pages should have at least five to 10 pieces of content linked from them. If not, update or remove the topic page. If a tag page is needed for another reason — but won’t be substantial — consider blocking it from crawling. Ensure topic pages have helpful, original title tags, meta descriptions and enough stories attached to the page. Here’s a guide to auditing tag pages.
Don’t be precious with thin content. Eliminate or update pages to put your site in line with your overall SEO best practices. Enhance thin content and improve your overall search authority.
What is duplicate content?
Duplicate content can harm your overall search performance. These pages confuse search engines about which pages to index and rank, eat up crawl budget and can be penalized for resembling plagiarism or being spammy.
Common duplicate content culprits:
Syndication: Content syndication refers to republishing content — most often, an exact copy — on one or more other websites.
If you’re republishing content from elsewhere on your site, it’s best practice to canonicalize and link back to the original, or
For publications that allow their content to be republished (for example, ProPublica has a “Republish” button), ensure the reposted story is being canonicalized back to you.
Refer to NewzDash’s guide to syndicated content for more details.
For example, Yahoo! is a site that commonly syndicates other publications’ content.
Daily recurring news files: A series or feature published daily, with limited variation in the content. Since these are so similar, if they are poorly optimized, they could be flagged as duplicate content.
For example, Mashable’s daily Wordle file has the same basement every day. However, the file has a clear, unique headline and lede, plus — obviously — clues that are unique to that day’s Wordle. While highly templated, it’s still varied and fresh — not to mention, it serves a clear reader need (not losing your streak!).
For a daily recurring news feature, ensure the headline and content is unique in some way. Keep a consistent headline format, but include the date so it’s unique from the previous version. This will also help if readers want to look through the archive for a specific date, too.
Articles that create multiple versions when tagged to multiple category or topic pages, or internal links with parameters;
Region variants: Content that is the same information, but is located at multiple URLs or on multiple domains to cater to the local audience (i.e., content found on
www.yourwebsite.com/us-news/is also found on
To fix this issue, Barry Adams suggests picking a main domain for the story — the one with the strongest topic authority — and canonicalizing the other.
Print-friendly pages: This can create duplicate content. Block these URLs with a robots.txt or a robots meta tag.
How to find duplicate content
How to find duplicate content: SEO tools like Botify and Screaming Frog can generate reports of all duplicate URLs. Or try looking at the title tags or H1s (headlines) of the pages: If there are two or more pages with the same headline, title tag or URL convention, it stands to reason they are duplicate pages.
How to avoid duplicate content
Write unique content. Use canonical tags, block in robots or
noindex pages that don’t need to be in search to prevent duplicate content issues. For syndicate content, canonicalize the URL and consider a
noindex tag, too.
For duplicate pages where the content is literally duplicated or is very, very similar, pick the stronger version and redirect the other file. If the page you redirect has some useful information, integrate it into the page you keep. It is better to have one really great piece of content than two mediocre files.
Here are Google’s canonicalization best practices, where the search engine explains in depth how to properly implement canonicals.
Canonical tags have been ignored by Google in the past. If syndication is a serious concern of your site, consider implementing a
noindextag across all syndicated versions.
At the article level, consider all on-page SEO elements. Ensure that on-page text — the headline, URL, meta description, introduction, lede and images — are unique.
Unique URL: Create a unique URL daily. If your URLs are date-based, you'll still want to differentiate the keyword portion from other similar files. Include the date (
march-3), number of the puzzle (
wordle-385) or another word to help vary the URL.
Unique headline/title tag: For daily news features (Today’s horoscope, daily markets stories, etc.), consider including the date in the headline.
Unique introduction or lede: Say something specific to that file. It will communicate to habitual readers and Google it’s a fresh page.
Unique deck/subtitle: Find something original to say. This could be the date, an interesting fact about the daily puzzle or a stock of interest. Don’t waste the opportunity to tell Google and your readers what this page is about.
Change up the images: Attempt to avoid the same stock or wire photo every day. At minimum, write fresh captions and alt text for the image (especially if it’s used often).
Pro tip: Canva is free. Is it possible for your photo/graphics desk to create a set of composites or photo illustrations to cycle through? For example, a unique image with the day’s date and a different colour background for each month (see screenshot below).
All of the above is easier said than done. Daily deadlines, limited resources and other newsroom demands are real limitations to your ability to make all content stand out. But if you’re putting in effort every day to create the content, also put in the effort to maximize its visibility.
The bottom line: Thin and duplicate content should be avoided. Make every page worthy of publication (and useful for your overall SEO strategy) by focusing on providing value, fulfilling search intent and writing custom on-page treats for every page.
#SPONSORED - The Classifieds
NewzDash: Revolutionize your workflow with SEO tips, AI recommendations, ranking alerts, content gap analysis and more. Elevate your content with NewzDash! 🚀
Get your company in front of more than 7,750 writers, editors and digital marketers working in news and publishing. Sponsor the WTF is SEO? newsletter!
🤖Google news and updates:
🛑 Now available: Google-Extended, a control for publishers to block Google from using your content to train their LLMs.
🤔Search Engine Journal: Google’s September 2023 Helpful Content Update has finished rolling out. Highlights include changing guidance on AI and third-party content hosting. (Related: The impact on travel blogs from Dan Taylor.)
✏️ Google: 50 training sessions for news publishers.
💬 Craig Harkins explains how to communicate Google updates to executives.
🧪SEMRush shared more than twenty examples of SEO A/B tests.
💾 Chris Moran spotted The Mirror’s use of AI in the production of content.
❓Attention pals: Send this TikTok to every parent, cousin and friend who asks, “What does an audience editor even do?”
What did you think of this week's newsletter?
(Click to leave feedback.)
Catch up: Last week’s newsletter