Edited By: Emilie Martin

April 17, 2024

Who Should Block AI Bots?

Marketing Industry | Advanced SEO

Way back in August 2023, OpenAI revealed their web crawler, GPTBot, and, in doing so, allowed website owners to block access in robots.txt — like one might block Googlebot from accessing certain sensitive or unhelpful sections of a website. A huge number of websites (as many as 48% in some segments) quickly took them up on this, according to studies by Originality.ai and the Reuters Institute at Oxford University. Google announced shortly after their separate “Google-Extended” bot, allowing sites to specifically block Google’s AI tools from some or all pages.

There’s been some debate since. The discussion is ongoing at Moz, at our parent company, Ziff Davis, and in the SEO industry at large about how best to use this newfound ability to deny access to (some) AI tools and, indeed, whether it even has any practical impact. With the limited information available right now, particularly regarding the future path of this data and these tools, I don’t think there is a confident one-size-fits-all answer. In this post, I want to lay out what the arguments, beliefs, premises, or business contexts might be that would cause you to block these bots or not.

Firstly, though — does it actually make a difference what you do?

“They already have all my content anyway”

Perhaps. OpenAI have used various data sources in the past, only recently releasing (disclosing?) their own crawler. For example, the Common Crawl was a huge chunk of the training data for GPT-3, and this is not the same as GPTBot. Few websites block the Common Crawl CCbot, which (among the few people who have heard of it) is considered a fairly light touch in terms of server demands with potentially wide-reaching benefits, well beyond training AI models. In addition, if you block the newer AI-specific bots now, you are not deleting any content that they have collected from your site in the past.

As such, at worst, you are only slowing their access to new content that you publish. You may nonetheless believe that this new content has some sort of unique value, especially when it is at its freshest. (It’s no coincidence that it’s news sites that are currently tending to block.)

However, it may well be duplicated on scraper sites elsewhere. I suspect that the more sophisticated models do incorporate some kind of authority signal (possibly links!), so the scraper site may not be as trusted as your own site or recrawled as liberally or regularly. I cannot confirm this, though.

“They don’t need my content”

You may believe that even if you act as part of a larger movement in your industry, AI bots will ultimately be able to produce content on the topics your site addresses just as well as you do. Even without input from your newly published or recently updated pages.

If so, I would firstly suggest that this may be a rather damning indictment of the value proposition of any content-centric site, likely with severe implications for their ongoing viability in SEO, regardless of any future AI developments or lack thereof.

Nonetheless, it may be true — some industries suffer from a huge number of sites publishing ultimately very similar content. Some industries are also very unlikely to see widespread blocking. These are factors you will have to consider as part of your decision.

The case for unblocking

I will make three arguments for leaving your website totally unblocked for AI bots:

Present-day traffic acquisition
Present-day brand exposure
Future developments in LLM-powered tools

Present-day traffic acquisition

I recently saw Wil Reynolds speak at SMX Munich, and in his talk, he made the powerful case for ChatGPT (or similar tools) as a significant present-day or near-future acquisition channel. He’s outlined a similar argument in this blog post, along with a methodology for getting a feel for how affected your business might be. I recommend you check it out. (He's also speaking at MozCon in June.)

This will definitely vary from one business to the next. That said, my present experience is that:

ChatGPT is not used primarily as a search engine but instead as an assistant, especially for content creation, translation, and coding
ChatGPT usage is stable or declining, and a tiny fraction of Google’s usage
Bing’s market share increased less than 1% since integrating ChatGPT, so it doesn’t seem like users found Bing’s similar functionality to be a game changer

I’ve made the case elsewhere that I don’t think generative AI is a like-for-like replacement for search. It’s a different tool with different uses. But you should assess this for your business.

In the case of “Google-Extended,” we also have to consider whether we think this affects Google Search as an acquisition channel. For now, Google says not, a claim which some people are understandably skeptical about. Either way, this may change rapidly if and when Google introduces generative AI search features.

Present-day brand exposure

Also, at SMX Munich, I saw Rand Fishkin (also speaking at MozCon in June) make the case that digital marketers get too hung up on attributing the bottom of the funnel, which is increasingly difficult, and should instead take a leaf out of the book of pre-web marketers who valued impressions, footfall, and similar “vanity metrics.” I agree! In fact, I wrote about it back in 2019. That post has since fallen victim to a slightly questionable site migration, but I’ve reuploaded it here.

On a similar basis, maybe I should not only care about whether ChatGPT (or indeed other outputs from LLM models, such as AI-written content) drive me traffic but also simply whether they mention my brand and products. Preferably, mentioning it in the same way I would.

If I prevent these models from accessing the pages where I talk about my products, and if I also subscribe to the argument above that preventing access does meaningfully affect what content the models can ingest, then I am making it less likely that they will be mentioned in an accurate way, or indeed at all.

This could be particularly impactful in a case where I launch a new product or rebrand — anything new will be ingested only via external sources, which may be less than positive, or, again, inaccurate.

Future developments in LLM-powered tools

What if we accept that current tools built on generative AI are not major acquisition channels? Will that always be the case? What if I block GPTBot now, and then in a year or two, OpenAI launches a search engine built on the index it has built?

Perhaps at that point, one might make a swift U-turn. Will it be swift enough? Often, these models are not exactly at Google’s level when it comes to quickly ingesting new content. Presumably, though, in order to be a competitive search engine, they would have to be? Or would they use Bing’s index and crawler? One might also make the argument that these models could use the (originality of?) content itself as an authority signal, as opposed to (for example) links, user signals, or branded search volume. Personally, I find that impractical and, as such, unlikely, but it’s all a big unknown, and that’s the point with all this — the uncertainty itself is not an attractive proposition.

On top of that, a search engine is only one (more likely) possibility — a few years ago, we would not have imagined ChatGPT would be as impactful as it has been.

The case for blocking

What, then, might motivate you to block these AI bots? Similarly to the case for unblocking, I think there are arguments here relating to present-day pragmatism and possible future developments.

Present-day content moats
Stalling for time
Future developments

Present-day content moats

The biggest threat that OpenAI’s models pose to Google and SEO today is not as a Google competitor but as a scalable content creation tool. This is highly disruptive in search, both making Google’s job harder and cannibalizing the traffic that might otherwise go to existing content.

If you are writing unique content containing new and interesting information, potentially, you devalue that content by allowing it to contribute towards AI-written articles, videos, and tools elsewhere. Do you want new competitors built (partly) on your new and future content?

That said, as mentioned above — it may be scraped elsewhere, reposted, and then ingested by AI bots nonetheless. Then again, the AI systems may not trust these scraper sites as much as they would yours for a variety of reasons.

So, what are you achieving? Perhaps a slight degradation in the quality and freshness of content generated on topics relevant to your site. Maybe a reduced chance that your site will be directly cited, which you may consider a bad thing (if these tools are prospective acquisition channels) and/or a good thing (if you’re concerned about misrepresentation).

You may also appreciate the ethical or even legal precedent set by not giving permission for your content to be re-used in this way. This looks to be a rather different implied contract to that offered by search engines, which send far more traffic back to the sites they crawl and mostly do not rewrite content without attribution. "It's also a different contract to that posed by tools like Moz and our own bots - again, we will not repurpose your content."

Meaningfully affecting the quality of response offered by these tools would, in many cases, require collective action — not just one site blocking, but a plurality or even large majority of sites blocking. However, that does seem to be occurring in some verticals.

Stalling for time

Speaking of legal precedents — there are various ongoing legal cases involving OpenAI right now, perhaps most notably the New York Times lawsuit. This is a huge threat to many current applications of this technology, and for OpenAI as a company, it may be existential. Some publishers may feel that blocking now will delay the threats they perceive for long enough to see robust legal (and commercial) frameworks introduced.

Future developments

Just as it’s possible that in the future, we will see more acquisition channels powered by these AI bots, it’s also possible that we will see more threats powered in the same way. Deepfakes of your brand? Copycat products? These developments seem like a better fit for the strengths of this technology as it is now, and most brands would do what they can to sabotage the quality of such creations.

The case for a partial block

Can you have it both ways?

This is robots.txt, after all — as SEOs, we know it’s possible to be very tailored in what you might leave open or closed.

What if you want the benefits — brand exposure, your product being mentioned, and up-to-date information included in responses? But you don’t want the risks — contributing to content competitors, being misquoted, or reducing the unique value of your site.

Of course, it isn’t quite that simple. But your best bet, in this case, might be to leave open product sections of your site but close content sections (such as the Moz blog) and, of course, the same internal/logged-in pages that you probably block to Googlebot.

This isn’t perfect — you are still leaving yourself open to some of both the risk of being left out of the conversation and the risk of fuelling AI-powered competitors.

Is it hypocritical to block AI bots whilst also leveraging generative AI?

Ironically, maybe less so than it was before these bots were blockable. In an ideal world, you might want to be able to use LLMs in your work whilst knowing that authors and creators who did not want their content used this way had been able to opt-out.

The reality is that at this point, you’re still working mostly(?) with information ingested before August 2023, but that will change over time.

It depends

Ultimately, this is going to come down to a combination of what you think the future holds and what is most important and most impactful to your business.

You should not block AI bots if you believe that:

AI chatbots are or will become, a notable acquisition channel (or some future LLM-based technology will).
- And
These models will be more likely to reference your company in their responses if their training set or index includes the content you publish between then and now.

It is better that the AI models have the latest information about your business, so even if they compete with your site as a source of information about your product, they are less likely to misrepresent you in the process.

You should block AI bots if you believe that:

You increase the threat posed by AI chatbots, competing content, or competing tools by allowing AI models to crawl your latest content.

Your choice to block alongside a similar decision from your peers will reduce the viability of LLM-powered content or tools in your sector, buying you time until legal and commercial safeguards are in place.

In many cases, this will come down not only to your beliefs about how this technology will evolve but also to the specifics of the business involved. For example:

Are any products powered by these AI bots currently an acquisition channel for you?
Is it more important for you to get mentioned (e.g., a startup) or to defend exclusive content (e.g., a news site)?

What is Moz doing?

Most Ziff Davis media brands (of which Moz is one) are publishing businesses, and like most other publishing businesses, they have blocked AI bots. Moz is considering a more nuanced approach, but this is still a topic of discussion!

What are you doing?

Personally, I’m surprised this issue is not more discussed, given the huge volume of sites choosing to block. What are your thoughts? Send us a tweet/thread/email/pigeon, and let us know!

Table of Contents