A WilsonCooke Group Company
WilsonCooke Logo
Back to thoughts

Googlebot File Limits for Crawling – Impact on Crawl Budget and Site Performance

Many organisations are now grappling with the implications of Google’s explicit documentation of file size limits for its crawling infrastructure. While these limits aren’t new, their formal acknowledgement necessitates a re-evaluation of technical SEO strategies, particularly for sites with complex architectures or extensive content. Understanding these constraints is crucial for optimising crawl budget and ensuring that critical content is indexed effectively, thereby safeguarding organic visibility and revenue. This article breaks down the implications for senior decision makers.

What changed in Google’s Googlebot file limit documentation

In early 2024, Google updated its documentation to explicitly outline file size limits for its crawling infrastructure.[1] This wasn’t a change in crawler behaviour – these limits have existed for years – but rather a clarification of existing constraints that many site owners had encountered without formal documentation.[2]

The update consolidated information about the 15MB general crawler limit and the 2MB limit for HTML files processed by Googlebot for Search.[1] Previously, these limits were mentioned sporadically across various Google resources, leading to confusion about which constraints applied in which contexts. The documentation now provides clear guidance on how different file types are treated and what happens when content exceeds these thresholds.

Many senior decision makers have been making infrastructure investments without understanding these fundamental constraints. Sites with extensive financial guides, interactive calculators, or legislative content have been unknowingly pushing against these limits, potentially compromising their search visibility. The documentation update provides the clarity needed to make informed technical decisions.

The timing of this clarification is particularly relevant as websites become increasingly complex. Single-page applications, JavaScript-heavy interfaces, and embedded data objects have become standard practice, often without consideration for how these architectural choices affect crawlability. Understanding these limits allows you to assess whether your current technical approach aligns with how search engines actually process your content.

With the clarification of Googlebot’s file size limits now established, it’s crucial to delve into the specifics of these constraints and their implications for your website’s crawlability and indexability.

Understanding Googlebot’s file size limits: the complete breakdown

The file size constraints operate at two distinct levels, each serving different purposes within Google’s crawling infrastructure. The 15MB limit applies to Google’s general crawling and fetching systems – the broader infrastructure that supports various Google services beyond just Search. When any Google crawler encounters a file, it will fetch up to 15MB of that resource before truncating the download.[3]

For HTML files specifically indexed by Googlebot for Search, a stricter 2MB limit applies.[3] This is the threshold that directly impacts your search visibility. Content beyond this point may not be processed for ranking purposes, regardless of whether it was successfully fetched. This distinction between fetching and indexing is where many technical SEO strategies falter – teams optimise for download speed without considering whether their critical content falls within the indexable range.

PDF files receive different treatment, with a 64MB limit reflecting their nature as self-contained documents.[3] This higher threshold acknowledges that PDFs often serve as comprehensive resources – annual reports, technical specifications, regulatory filings – where extensive content is expected and appropriate.

These limits apply to uncompressed file sizes. Your server may deliver a 500KB compressed HTML file, but if it expands to 2.5MB when decompressed, you’ve exceeded the indexing limit. This catches many teams off guard, particularly those who’ve invested heavily in compression without auditing their actual HTML payload.

The technical reasoning behind different limits reflects how Google processes various content types. HTML requires parsing, rendering, and link extraction – computationally expensive operations that become increasingly costly with larger files. PDFs, whilst also requiring text extraction, follow more predictable processing patterns that scale better with file size.

Now that we’ve clarified the specific file size limits, it’s important to understand how Googlebot processes web pages and how these limits are enforced during the crawling and rendering phases.

How Googlebot crawls and renders web pages

Googlebot operates through a two-stage process that determines what content ultimately appears in search results. During the crawling phase, the bot downloads your HTML and adds discovered URLs to its queue. This is where the 15MB general limit applies – if your initial HTML exceeds this threshold, the download may be truncated, potentially missing links to important sections of your site.

The rendering phase follows, where Googlebot executes JavaScript, applies CSS, and constructs the final page state. For JavaScript-heavy sites, this distinction becomes critical. Your initial HTML might be lean, but after JavaScript execution, the rendered output could balloon beyond the 2MB indexing limit. This is particularly relevant for React, Vue, or Angular applications where much of the content is generated client-side.

Google primarily uses its smartphone crawler for indexing, reflecting the mobile-first approach that now dominates search.[4] Both smartphone and desktop variants respect the same file size constraints, but the mobile crawler’s behaviour should guide your optimisation priorities. If your mobile experience generates larger HTML payloads than desktop – perhaps through different JavaScript bundles or responsive content – you’re optimising for the wrong target.

The existence of two distinct file size limits can indeed be a source of confusion. Let’s clarify the difference between the 15MB and 2MB limits and how they impact your SEO efforts.

The 15MB versus 2MB confusion: what you need to know

The existence of two different limits creates understandable confusion, but the distinction is straightforward once you understand the context. The 15MB threshold governs Google’s general crawling infrastructure – the systems that fetch content across all Google services. Think of this as the absolute maximum that any Google crawler will attempt to download.

The 2MB limit specifically constrains what Googlebot for Search will process for indexing purposes.[3] This is the number that matters for your search visibility. Whilst Google’s crawlers might fetch up to 15MB of your HTML, only the first 2MB receives full consideration for ranking. Content, structured data, or internal links beyond this point may not influence your search performance.

For SEO purposes, focus on the 2MB threshold. Ensuring your HTML stays under 15MB is good practice for general crawlability, but the 2MB limit directly determines whether your critical content reaches the index. This means prioritising above-the-fold content, ensuring key schema markup appears early in your HTML, and placing important internal links within the indexable range.

Having clarified the distinction between the 15MB and 2MB limits, let’s examine the file size limits for specific file types and address some special cases that may affect your website.

File type specific limits and special cases

PDF files receive the 64MB limit because they typically function as complete documents rather than navigational pages.[3] Financial reports, technical manuals, and regulatory filings often require this additional capacity. This higher limit shouldn’t encourage complacency – a 60MB PDF still creates poor user experience and may face partial indexing if the most relevant content appears late in the document.

Images, videos, and external resources don’t count towards your HTML file size. These assets are fetched separately and subject to their own constraints.[5] This means a page with dozens of high-resolution images won’t breach the 2MB limit, provided those images are referenced via standard img tags rather than embedded as data URIs.

Data URIs and inline resources do contribute to HTML file size. When you embed an image as a base64-encoded string within your HTML, that encoded data counts towards your 2MB budget. Similarly, inline CSS and JavaScript increase your HTML payload. This is where well-intentioned performance optimisations can backfire – embedding small assets to reduce HTTP requests may push you over the indexing threshold.

Embedded JSON-LD structured data also counts towards your HTML size. Sites implementing extensive schema markup, particularly e-commerce platforms with detailed product schemas, can inadvertently consume significant portions of their 2MB budget. This doesn’t mean avoiding structured data – it means implementing it efficiently and considering whether all schema properties genuinely add value.

[Image suggestion]: Technical view of Chrome DevTools showing how to check HTML file sizes for crawl optimisation

Now that you understand the file size limits and their nuances, let’s explore how to check if your pages are exceeding these limits and identify potential issues.

How to check if your pages exceed Googlebot file limits

Chrome DevTools provides the most accessible method for checking file sizes. Open DevTools, navigate to the Network tab, and reload your page. The initial HTML document appears as the first entry. The Size column shows both transferred size (compressed) and content size (uncompressed). Focus on the content size – this is what Googlebot evaluates against the 2MB limit.[5]

For enterprise-scale auditing, Screaming Frog SEO Spider offers comprehensive site-wide analysis. Configure it to crawl your site and examine the Size column, which displays uncompressed HTML file sizes. You can filter results to identify pages approaching or exceeding thresholds, then export this data for prioritisation. This approach scales to sites with thousands of pages, providing the visibility needed for informed decision-making.

Google Search Console’s URL Inspection tool offers indirect evidence of file size issues. Whilst it doesn’t report file sizes directly, errors like “Page partially rendered” or indexing warnings may indicate that content exceeded processing limits. Use this alongside direct measurement tools to build a complete picture of your crawlability.

For ongoing monitoring, implement automated checks within your development workflow. Configure your crawling tool to run regular audits and alert you when pages breach predefined thresholds. This prevents file size issues from reaching production, where they can impact search visibility before you’re aware of the problem.

While checking your file sizes is essential, it’s equally important to understand when these limits actually pose a risk to your website’s performance. Let’s explore the scenarios where file size limits can have a tangible impact.

When file size limits actually impact your website

Most websites operate comfortably below the 2MB threshold. Standard corporate sites, blogs, and even many e-commerce product pages rarely approach this limit. The risk emerges in specific scenarios where content density or technical architecture creates unusually large HTML payloads.

Financial services sites publishing comprehensive guides face genuine risk. A detailed retirement planning resource with embedded calculators, comparison tables, and extensive explanatory text can easily exceed 2MB, particularly if the content isn’t carefully structured. If your key differentiating information appears in the latter sections of such guides, it may not reach the index.

Single-page applications built with modern JavaScript frameworks warrant particular attention. These applications often bundle significant JavaScript code within the initial HTML payload. A React application with poor code-splitting might deliver 3MB of JavaScript in the initial load, immediately exceeding the indexing limit before any actual content is considered.

Legislative or regulatory content presents similar challenges. Sites hosting complete legal documents, compliance guidelines, or policy frameworks often create extremely long pages. Whilst this serves user needs – providing complete information in one location – it can compromise search visibility if critical sections fall beyond the indexable range.

E-commerce sites with extensive faceted navigation sometimes embed large JSON objects containing product data for client-side filtering. This approach improves user experience but can consume your entire 2MB budget before the actual product content is considered. The result is a site that functions well for users who arrive directly but struggles to attract organic traffic because key content isn’t indexed.

Now that we’ve identified the scenarios where file size limits matter most, let’s examine the common culprits that lead to bloated HTML files and exceed these limits.

Common causes of bloated HTML files

Inline JavaScript and CSS represent the most frequent culprits. Developers often embed scripts directly in HTML for convenience or to reduce HTTP requests, not realising the cumulative impact on file size. A few kilobytes of inline code per component quickly accumulates across a complex page, consuming budget that should be reserved for actual content.

Embedded data objects, particularly in e-commerce contexts, can dramatically inflate HTML size. Sites that embed entire product catalogues as JSON for client-side rendering may include megabytes of data that users never see. This approach prioritises a specific technical architecture over crawlability, often without conscious recognition of the trade-off.

Excessive inline SVG usage contributes to bloat, particularly on pages with numerous icons or decorative elements. Whilst SVGs offer scalability and styling flexibility, embedding them directly in HTML means every page load includes the complete SVG markup. Referencing external SVG files or using sprite sheets provides the same visual result with significantly reduced HTML payload.

Unminified code remains surprisingly common, even on production sites. Whitespace, comments, and verbose variable names serve important purposes during development but offer no value in production. The cumulative impact of unminified HTML, CSS, and JavaScript can add hundreds of kilobytes to your file size – budget that could be allocated to actual content.

Having identified the common causes of bloated HTML files, let’s explore actionable strategies for optimising pages that approach the 2MB limit and ensuring they remain crawlable and indexable.

Optimising pages that approach the 2MB limit

Externalising scripts and styles delivers immediate impact with minimal implementation complexity. Move inline CSS to external stylesheets and inline JavaScript to separate files. This not only reduces HTML file size but enables browser caching, improving performance for returning visitors. For most sites, this single change can recover hundreds of kilobytes of indexable space.

Code splitting transforms how JavaScript-heavy applications approach the file size challenge. Rather than delivering your entire application bundle in the initial load, split code into smaller chunks loaded on demand. Modern build tools like Webpack and Rollup make this relatively straightforward, allowing you to defer non-critical functionality until after the initial render. This keeps your initial HTML lean whilst maintaining full functionality.

Lazy loading defers the loading of below-the-fold content until users scroll to it. The loading=”lazy” attribute on images and iframes provides native browser support for this pattern. For more complex scenarios, Intersection Observer API enables sophisticated lazy loading strategies that balance user experience with initial payload size.

Audit your data URIs and replace them with external references. Whilst embedding small images as base64 strings can reduce HTTP requests, the encoded data is significantly larger than the original image and counts towards your HTML budget. External image references don’t contribute to HTML file size and benefit from browser caching.

For genuinely long-form content, consider breaking pages into logical sections. Rather than publishing a 5,000-word guide as a single page, create a hub page linking to detailed subsections. This improves crawlability, enhances internal linking structure, and ensures each page stays well within file size limits. Users can still access complete information, but the architecture better aligns with how search engines process content.

With effective strategies for optimising individual pages, it’s crucial to understand how file size implications relate to Google’s crawl budget and overall site performance.

Understanding Google crawl budget and file size implications

Crawl budget represents the number of URLs Googlebot will crawl on your site within a given timeframe. It’s determined by your server’s capacity to handle requests and Google’s assessment of your content’s value and update frequency. Whilst Google doesn’t publish specific crawl budget allocations, the constraint is real and impacts how quickly new or updated content reaches the index.

File size directly affects crawl efficiency. Larger files consume more time to download and process, reducing the number of pages Googlebot can crawl within its allocated budget. If your average page size is 1.5MB versus 500KB, Googlebot can crawl three times as many pages in the same timeframe with the smaller files. For large sites, this difference determines whether important content is discovered and indexed promptly.

The relationship between file size and server response time compounds this effect. Slow servers delivering large files create a particularly inefficient crawling scenario. Googlebot spends more time waiting for each response and then more time processing larger payloads. Optimising both server performance and file size maximises the value extracted from your crawl budget.

Crawl budget optimisation matters most for large e-commerce sites with extensive product catalogues, news sites publishing high volumes of time-sensitive content, and sites with complex faceted navigation creating numerous URL variations. For these sites, efficient crawling directly impacts revenue – products that aren’t indexed can’t be found, and news that’s indexed slowly misses its relevance window.

While crawl budget is a critical consideration, it’s important to remember the specific case of PDF files and the 64MB exception that applies to them.

PDF files and the 64MB exception

The 64MB limit for PDFs reflects their role as comprehensive, self-contained documents.[3] Annual reports, technical specifications, and research papers legitimately require this capacity. This higher limit shouldn’t encourage publishing unwieldy documents that compromise user experience.

PDFs are rarely the optimal format for SEO purposes. HTML provides superior control over on-page optimisation, structured data implementation, and internal linking. PDFs should be reserved for content that users genuinely need to download – resources they’ll reference offline or print. If your content is primarily consumed online, HTML almost always serves users and search engines better.

When PDFs are appropriate, optimise them properly. Ensure text is selectable rather than scanned images. Compress images within the PDF to reduce file size. Add metadata including title, author, and description to help search engines understand the content. Use descriptive filenames that include relevant terms without keyword stuffing.

Link to PDFs from relevant HTML pages to aid discovery. A PDF sitting in your file system without incoming links may never be crawled. Create HTML landing pages that describe the PDF content and provide context, then link to the PDF as a downloadable resource. This approach serves both users and search engines more effectively than relying on direct PDF indexing.

Beyond PDFs, it’s essential to consider how JavaScript rendering interacts with crawl limits, particularly for modern web applications.

JavaScript rendering and crawl limit interactions

File size limits apply to both initial HTML and the resources fetched during rendering. For JavaScript-heavy sites, this creates a two-stage challenge. Your initial HTML must stay under 2MB, and the JavaScript files loaded during rendering must also respect size constraints. A lean initial HTML that loads a 5MB JavaScript bundle hasn’t solved the problem – it’s simply moved it.

The two-wave indexing process for JavaScript sites means content only visible after JavaScript execution may not be indexed immediately.[6] Googlebot first indexes the raw HTML, then returns to render and index the JavaScript-generated content. Critical content should be present in the initial HTML, even in minimal form, to ensure first-wave indexing. Relying entirely on JavaScript rendering delays indexing and risks incomplete content processing.

Single-page applications built with React, Vue, or Angular require particular attention to these constraints. These frameworks often generate large initial JavaScript bundles that can exceed file size limits. Server-side rendering addresses this by generating complete HTML on the server, ensuring content is immediately visible to crawlers. Server-side rendering introduces architectural complexity and may not be necessary for all applications.

For sites that must use client-side rendering, aggressive code splitting becomes essential. Load only the JavaScript required for initial render, deferring additional functionality until after the page is interactive. This keeps both HTML and initial JavaScript payloads within acceptable limits whilst maintaining full application functionality.

To effectively manage file sizes at scale, particularly for enterprise sites, it’s crucial to implement robust monitoring and maintenance strategies.

Monitoring and maintaining compliance at scale

Enterprise sites require automated monitoring systems that provide continuous visibility into file sizes across thousands of pages. Manual audits don’t scale and can’t catch issues introduced by ongoing development. Implement monitoring solutions that crawl your site regularly, track file size trends, and alert you to violations before they impact search visibility.

Configure automated alerts for file size thresholds. When a page exceeds 1.8MB – leaving a safety margin below the 2MB limit – your team should receive immediate notification. This enables proactive remediation rather than discovering issues through declining search performance. Set different alert thresholds for different page types, recognising that product pages and content pages have different typical sizes.

Integrate file size checks into your CI/CD pipeline to prevent issues reaching production. Automated tests should verify that newly deployed pages respect file size budgets, failing the deployment if violations are detected. This shifts quality control left in your development process, catching problems when they’re cheapest to fix.

Establish file size budgets for development teams, providing clear guidelines for maximum allowable sizes for different page types. A product page might have a 1.5MB budget, whilst a category page receives 1MB. These budgets should be documented, monitored, and enforced through automated tooling. Regular training ensures developers understand why these constraints exist and how their architectural decisions impact crawlability.

With a solid understanding of monitoring and maintenance, it’s time to address some common myths and misconceptions surrounding Googlebot file limits.

Common myths and misconceptions about Googlebot file limits

The most persistent myth is that these limits represent new restrictions requiring immediate action. These constraints have existed for years – Google simply clarified existing behaviour in its documentation.[2] If your site has been performing well in search, you’re likely already compliant. Panic-driven infrastructure changes based on misunderstanding this update waste resources and introduce unnecessary risk.

Images and videos don’t count towards HTML file size limits. This misconception leads to misguided optimisation efforts, with teams compressing images that have no impact on the relevant constraint. Focus optimisation efforts on actual HTML payload – inline scripts, embedded data, and the HTML markup itself.

Pages exceeding 2MB won’t automatically fail to rank. If your critical content, structured data, and internal links appear within the first 2MB, you may see no impact. The issue arises when important elements fall beyond the indexable range. A 3MB page with all key content in the first megabyte faces less risk than a 2.1MB page with critical information at the end.

Not every site requires immediate optimisation. Run an audit to understand your current state before investing in changes. Most sites operate comfortably below limits and can address file size as part of routine maintenance rather than emergency remediation. Focus resources on pages that actually approach or exceed thresholds.

As search technology continues to evolve, it’s crucial to future-proof your website by anticipating changes and adapting your SEO strategies accordingly.

Future-proofing for evolving search technologies

Search technology continues evolving towards more efficient content processing. AI-powered search experiences and generative results, driven by advanced models like GPT-5, require search engines to process larger volumes of content more quickly. This trend suggests that file size efficiency will become increasingly important, even if specific limits don’t change.

Lean, well-structured HTML is more easily processed by AI systems that power features like AI Overviews. These systems need to extract meaning and context from your content quickly. Bloated HTML with excessive markup and embedded data creates unnecessary processing overhead, potentially reducing your content’s inclusion in AI-generated results.

Sustainable technical SEO practices focus on fundamentals that remain relevant regardless of specific algorithm changes. Prioritise Core Web Vitals, implement structured data efficiently, and create content that serves user needs. These practices improve crawlability, enhance user experience, and position your site to adapt as search technology evolves. For a deeper dive into how these technical SEO best practices aid AI in parsing your content effectively, explore Answer Engine Optimisation (AEO).

Monitor developments in search technology and adjust your approach accordingly. Google’s documentation updates, industry research, and performance data from your own site provide signals about where to focus optimisation efforts. Maintain flexibility in your technical architecture to accommodate future changes without requiring complete rebuilds.

To put these insights into action, let’s outline a comprehensive plan for implementing file size optimisation on your website.

Action plan: implementing file size optimisation

Begin with a comprehensive audit using Screaming Frog or similar enterprise crawling tools. Identify pages exceeding 1.8MB – providing a safety margin below the 2MB limit – and prioritise based on traffic, conversion value, and strategic importance. This data-driven approach ensures you address the pages that actually matter to your business.

Implement quick wins first: externalise inline scripts and styles, remove unnecessary whitespace, and audit data URIs. These changes typically require minimal development effort but can recover significant indexable space. For most sites, this phase can be completed within two to three weeks.

Develop a longer-term optimisation strategy for complex issues. JavaScript-heavy applications may require code splitting implementation, which involves more substantial architectural changes. E-commerce sites with embedded product data might need to redesign how they handle client-side filtering. These initiatives typically span several months but deliver sustainable improvements.

Establish ongoing monitoring and governance. File size optimisation isn’t a one-time project – it requires continuous attention as your site evolves. Implement automated monitoring, set clear budgets for development teams, and integrate checks into your deployment pipeline. This prevents regression and ensures new content respects crawlability constraints.

Addressing Googlebot’s file size limits requires a strategic approach that aligns with your business objectives and technical capabilities. Ignoring these constraints can lead to reduced organic visibility and lost revenue opportunities. Request a technical SEO consultation with Wilson Cooke to assess your specific risks and develop a tailored optimisation strategy that delivers measurable results.

References

  1. searchenginejournal.com
  2. nikki-pilkington.com
  3. seroundtable.com
  4. techwyse.com
  5. google.com
  6. google.com
February 9th, 2026
Dan Nation
Head of SEO
  • *Denotes a required field. By submitting this form, I agree to WilsonCooke's terms and conditions