By Tom Cranstoun
Let me share something fascinating: The Web is experiencing its most significant transformation since its beginning in the 1990s, shifting from human-centric design to a "robot-first" approach where AI systems are becoming primary consumers of web content.
While early web protocols helped manage human access across devices and restrict access by robots like search engine crawlers, today's websites actively court robot engagement for improved user experiences, automation and to feed the algorithm and language models. However, this shift brings challenges – from AI manipulation concerns to questions about data privacy and algorithmic bias.
Remember when "mobile-first" was the hot trend in web development? Well, get ready for "AI-first," or with all the confusion around AI as a term, let’s just call it what it really is: Robots-first.
New standards and protocols are emerging to help website owners manage AI interactions while preserving the web's core mission of democratized knowledge sharing. This article will explore how these changes impact web development strategies impacting how one approaches web development. It’s all about
Start planning for AI-ready content
Consider implementing markdown versions of your pages
Think about your AI interaction strategy
Prepare for increased AI traffic
Stay informed about evolving standards
How we got here
When Tim Berners-Lee invented the World Wide Web, human readers were his focus. The web was built around human interaction - reading pages that looked similarly across devices, browsing hypertext links, and really sharing information among humans.
The first automated web visitors were humble crawlers from early search engines like WebCrawler and AltaVista. To manage these digital visitors, websites relied on a simple but effective solution: robots.txt. This plain text file acted as a digital gatekeeper, letting site owners tell crawlers which areas to index and which to avoid. Need to keep a website-under-construction or employee portal private? Robots.txt provided an elegant solution - a kind of "Do Not Enter" sign for the robots of that time in the late 90’s.
This gentlemen's agreement between humans and robots worked well in the web's early days. But then Google transformed everything. Their PageRank algorithm showed that machines could do more than just catalog pages - they could understand relationships between content and use these links to assess its quality. Search engine optimization (SEO) emerged as websites began optimizing for Google's increasingly sophisticated crawlers., Picture the early 2000s: Companies hired SEO experts who stuffed keywords into hidden text, built elaborate link farms, and created doorway pages - all to climb Google's rankings.
Social media brought the next wave of automation. Twitter bots, Facebook's news feed algorithms, and recommendation engines started actively shaping how humans experienced the web. The relationship between humans and machines online grew more complex - bots weren't just reading content to appear in search results, they were influencing what humans saw and how they interacted.
The old social contract based on robots.txt is showing its age and as others have reported (see The Verge: The text file that runs the internet), “the basic social contract of the web is falling apart.”
Today, the Robots have improved
As we enter the age of Large Language Models and AI agents that don't just read or rank content - they comprehend it at a near-human level. When machines can understand content as well as humans can, we need new frameworks for managing their access and interaction with web content.
Just as the early internet saw businesses scrambling to establish their first websites, we're now witnessing a similar rush to optimize for artificial intelligence. But this time, the stakes are much higher. Instead of just trying to rank higher in search results, organizations are now racing to shape how AI systems understand their brand and content.
The numbers tell a story: According to the Verge article shared earlier on, AI crawlers like GPTBot and Claude now generate about 28% of Googlebot's traffic. This is a wake-up call. These AI systems aren't simply indexing content; they're learning from it, understanding it, and using it to generate new insights. While Google's infrastructure handles modern web architectures effectively, many AI crawlers struggle with JavaScript rendering and face high error rates - creating a fascinating technical challenge for organizations invested in headless content management systems.
If you are a bit technical minded, you might recognize the resurgence of Server Side Rendering, but with greater consequences. This pattern isn't new: businesses once prioritized server structured content for search engines until Google's advancements and ability to execute Javascript relaxed those requirements. However, the emergence of AI crawlers has reintroduced the need for specialized content formatting. Businesses now face a crucial decision: either they remodel their infrastructure to cater to AI consumption or risk being overlooked digital intelligence.
Market leaders aren't waiting to see how this plays out. With $13 billion already invested in generative AI search technology in 2023, a clear divide is emerging between early adopters and traditional operators. Some companies are going beyond simple optimization - they're creating sophisticated "AI honeypots" and "synthetic content networks" designed to influence how AI systems learn and understand their industry.
Why Your Website Needs to Speak AI's Language
The introduction of Large Language Models (LLMs) to the web requires the creation of tools and standards that allow these models to smoothly interact with websites and software, particularly those launched after the model's training data cutoff.
When a new restaurant opens in your city, it takes time for review sites and local guides to catch up. But if that restaurant provides a clear description of what they offer and how they operate, visitors can understand it immediately. We need something similar for websites that serves the same purpose for AI agents, telling them "what your site does and how to best understand its content" without waiting for the next AI training cycle.
Vendors like Squarespace have already taken action as shown on the screenshot, but rather than a set of different and proprietary ways to solve this Jeremy Howard, once President and Chief Scientist of Kaggle, the largest AI & ML community, proposes a new llms-txt industry standard.
llms.txt offers a structured format for LLMs to comprehend and navigate websites and APIs in real-time. The proposed standard would provide a structured framework for AI language models to navigate websites and APIs in real-time. While similar in simplicity to robots.txt, this new standard is specifically designed for the AI era, offering more nuanced guidance than the binary crawler permissions of its predecessor. AI companies like Anthropic are adopting the framework, which can be implemented alongside existing web standards.
Let’s look a bit closer at the potential value of llms.txt
Enhanced Website and API Interaction: llms.txt offers a richer and more dynamic representation of a website's content and functionality compared to traditional sitemap.xml files. This allows content creators to provide additional context, and metadata that guide LLMs in accessing and utilizing the website's resources.
Bridging the Training Data Gap: LLMs are typically trained on a static snapshot of the web, which means they may not be aware of new websites, features, or changes to existing ones. llms.txt acts as a bridge, providing up-to-date information and instructions that enable LLMs to interact with these changes.
Facilitating AI: By adopting llms.txt, website developers and API providers can create a more AI friendly web, where LLMs can easily discover, access, and utilize information and services. This has the potential to unlock new possibilities for automation, AI Agents with actions, and personalized user experiences.
Filling in the gaps: when you have a headless website, using React, Vue or similar technologies the llms.txt and any markdown files may help the AI crawlers to understand your content.
A Markdown representation will enable AI to understand dynamic pages and provide clean, machine-readable content. This is important because AI cannot execute Javascript in headless pages and current web pages are often cluttered with ads, teasers, and analytics that confuse AI. The Markdown Representation will tell AI what is happening on the page, making it easier to process and understand.
Schema markup: including JSON-LD, enhances content fragments for AI and search engines like Google, while also improving the user experience through extracted fragments that provide greater value to both humans and machines.
We're not just changing how machines find our content - we're influencing how they understand our world. For businesses, the question isn't whether to join this AI gold rush, but how to do it responsibly while maintaining authentic value for human visitors.
The infancy of generative AI search adoption can be attributed to several factors. These include the need for further technological advancements, user hesitations due to unfamiliarity with the technology, and the time it takes for businesses and individuals to adapt to new search paradigms. AI changes rapidly, faster than humans comprehend, and as this barrier will drop and more users will switch to AI search. Will you be ready for this move?
But here's the catch: it's built on the same honor system as robots.txt that's already showing cracks. Just like its predecessor, llms.txt relies on browser agents choosing to play nice – and with billions of dollars at stake in the AI race, that's a big ask.
The Dawn of the Robot-First Web
The concept of a Robot-First Web represents a shift in how we design and interact with the World Wide Web. It's a reimagining of the web's architecture. This shift signifies a move from a model of exclusion to one of inclusion, where information is proactively shared with automated systems.
By embracing a Robot-First approach, websites can optimize their content for machine learning algorithms, enabling more effective and nuanced indexing, search, and analysis. This has far-reaching implications for how we access and utilize information online, potentially leading to more personalized, relevant, and efficient user experiences.
Just as mobile-first design transformed web development, the Robot-First Web represents a crucial inflection point. Unlike the mobile revolution which changed content presentation, this transformation affects how the web is fundamentally consumed and understood.
Still, it’s a double-edged sword with several opportunities such as:
Enhanced AI comprehension and search relevance
Improved automated interactions
More efficient and personalized user experiences
Future-proofing web presence
Potential for innovative AI-driven services
And then there’s also quite a few challenges:
Implementation complexity and maintenance burden
Risk of misuse and manipulation
Privacy and security considerations
Potential digital divide between AI-ready and AI-unfriendly content
Balance between machine optimization and human experience
Look Beyond Technical Specifications
While the proposed new standard, llms.txt, provides a framework for the transition to robots-first, it represents just one piece of a larger transformation.
The core challenge is rethinking web content strategy for an era where machines are becoming equal consumers alongside humans.
My recommendation is that you consider these three practical next steps:
1. Audit Current Content
Evaluate AI readability
Assess current technical infrastructure
Identify key content for AI optimization
2. Strategic Planning
Evaluate early adoption trade-offs
Develop AI interaction strategy
Consider ethical implications
3. Technical Implementation
Experiment with markdown versions of key content
Monitor llms.txt evolution
Implement structured data and schema markup
Maintain balance between human and AI accessibility
The Future is Hybrid
The web is evolving into a hybrid human-AI environment, but this shouldn't mean sacrificing human experience for machine readability. Success lies in serving both audiences effectively while maintaining content integrity and user value.
The question isn't whether to prepare for this change, but how to do so thoughtfully and systematically. As we shape the Robot-First Web, our collective approach will determine its impact on the future of digital interaction.
Ready to get started? The future of the web is being written in markdown, json-ld and llms.txt one file at a time.
The key question isn't if, but when, to make your content AI-ready. Early adopters of AI-compatible web development will gain a significant edge.
Learn more about the robot-first web
Vercel has written an excellent piece on The rise of the AI crawler. For more details on llms.txt, see the piece by Towards Data Science on LLMs.txt Explained, and you can also browse the llms.txt directory. You can also take a closer look at Anthropics implementation of llms.txt.
I’ve also written a comprehensive guide to creating an llms.txt file. For more in AI, I wrote a piece on Selecting an AI model that works for you in May 2024.
The conversation naturally continues at our conferences and groups. Coming up first is CMS Kickoff 25 held in St. Pete, Florida on January 14 - 15, where you can also meet Tom in person and several other digital leaders from analysts, vendors, agencies and practitioners. Tom will also be at CMS Summit in Frankfurt in May.
Coming up are also local CMS Experts meetings around North America and Europe and we do continue our ambitious and curious learning program, also with regular members calls, so that we can collaborate and build a better Web together.