Google's Crawling December Series Summarized with NotebookLM

I know you came for this, so here it is.

Context:

Google recently introduced a new series of crawling topics called Crawling December. The series had four total posts that covered topics like the why and how of Googlebot crawling, HTTP caching, faceted navigation, and CDNs.

Crawling December announcement post on LinkedIn by Google Search Central

It covers best practices around all these topics. For example, you can use the URL Inspection tool in Google Search Console (GSC) to check your page, use CDN to save money and resources, use rel=”canonical” for faceted navigation, and more.

Reading all four articles can take you a bit of time (maybe). So I got a little inspired by John Mueller and Andrew Optimisey to get NotebookLM to make this podcast episode.

I copy pasted all four links into NotebookLM and clicked on “Generate” from the Studio to generate a ~16-minute podcast episode.

Key Takeaways

If you haven’t guessed already, this whole thing is made by AI. From their script to their voices and everything. And I really find it interesting how I ended up listening to all of it and how clear they sound. It’s terrifying and cool at the same time. Here are some of the key takeaways and main points:

When using faceted navigation, use rel="canonical" to consolidate signals to a main page, and rel="nofollow" on filter links consistently to discourage crawling
Google recommends using ETag for caching, as it’s more robust than Last-Modified, but ideally, both should be implemented.
When content changes significantly, update ETag or Last-Modified values to trigger a cache refresh for clients like Googlebot
Blocking critical resources with robots.txt negatively impacts page rendering and search ranking.
The Web Rendering Service (WRS) caches resources for up to 30 days, reducing crawl budget consumption.
Site owners can influence the crawl budget by minimizing resources, using cache-busting parameters cautiously, and strategically hosting resources.

Transcription of the podcast

Here’s the transcription of the episode if you’re interested:

Speaker 2

(0:00) Ho, ho, ho. (0:01) Happy holidays, everyone. (0:03) I know it’s Christmas, but Google decided to give us SEOs, an early present this year.

Speaker 1

(0:09) Oh, yeah. (0:10) What’s that?

Speaker 2

(0:10) Their new Crawling December series. (0:13) It’s like an advent calendar, but with SEO knowledge bombs instead of chocolate.

Speaker 1

(0:17) All right, I’m intrigued. (0:19) So what are we diving into today?

Speaker 2

(0:21) Well, for today’s deep dive, we’re going to unwrap four juicy topics from Google’s Crawling December series.

Speaker 1

(0:27) Hit me with it.

Speaker 2

(0:27) Basics of crawling, HTTP caching, faceted navigation, and, of course, CDNs.

Speaker 1

(0:33) Nice. (0:34) A little something for everyone, no matter where you’re at on your SEO journey.

Speaker 2

(0:37) Exactly. (0:37) We’re all about spreading holiday cheer and knowledge. (0:39) So let’s kick things off with the most fundamental topic, the basics of crawling.

Speaker 1

(0:44) Back to basics, I like it. (0:46) So Google put out this post, Crawling December, the how and why of Google bot crawling. (0:51) It’s a good reminder of how it all works.

Speaker 2

(0:53) Yeah, sometimes it’s good to get a refresher, especially when it’s straight from Google. (0:56) Anything surprising in there?

Speaker 1

(0:58) Well, they laid out the crawling process really clearly. (1:00) Google bot fetches a URL, handles redirects or errors, and sends the content off to be indexed.

Speaker 2

(1:06) Sounds pretty straightforward.

Speaker 1

(1:07) Right. (1:08) But there’s actually a lot more going on behind the scenes.

Speaker 2

(1:10) Oh, I bet. (1:11) Like what?

Speaker 1

(1:12) For one, a lot of people don’t realize that Google bot doesn’t just download the HTML of a page.

Speaker 2

(1:17) It doesn’t.

Speaker 1

(1:18) Nope. (1:18) It downloads all the resources needed to render that page like a browser would.

Speaker 2

(1:23) Oh, wow. (1:23) So like JavaScript, CSS, images, videos.

Speaker 1

(1:28) The whole shebang.

Speaker 2

(1:29) Yep, like a digital hoarder.

Speaker 1

(1:30) I love that. (1:32) Google bot hoarding all the digital treasures. (1:35) But seriously, though, downloading all those resources must have a huge impact on the crawl budget, right?

Speaker 2

(1:40) Oh, absolutely, especially for sites with tons of images or videos.

Speaker 1

(1:44) Yeah, you only have so much crawl budget to work with.

Speaker 2

(1:46) Exactly. (1:47) And that’s why Google actually revealed a little secret in their post. (1:50) Ooh, what is it? (1:51) Their web rendering service, or WRS, actually caches those resources, JavaScript and CSS, for up to 30 days.

Speaker 1

(1:59) 30 days? (2:00) That’s way longer than typical HTTP caching.

Speaker 2

(2:03) Exactly. (2:03) And they do that to save your precious crawl budget.

Speaker 1

(2:06) That makes sense. (2:07) Imagine if Google bot had to redownload all those resources every single time. (2:10) What a waste.

Speaker 2

(2:11) I know, right? (2:12) OK, that’s a good tip. (2:13) But what about images and videos?(2:16) Those must be a crawl budget drain, too.

Speaker 1

(2:18) They totally are, which brings us to one of the most important things you can do as an SEO.

Speaker 2

(2:24) Let me guess. (2:25) Manage your crawl budget.

Speaker 1

(2:27) Bingo! (2:28) You got it. (2:29) It’s like holiday spending. (2:30) You don’t want to blow it all on decorations and have nothing left for gifts.

Speaker 2

(2:34) I love that analogy. (2:35) So what are some practical tips for managing a crawl budget like a pro?

Speaker 1

(2:39) Well, first and foremost, be mindful of the resources you’re using. (2:43) Only use what’s absolutely necessary. (2:45) Fewer resources, less crawl budget spent.

Speaker 2

(2:48) OK, less is more. (2:49) Got it.

Speaker 1

(2:50) Yep. (2:50) Second, watch out for cache-busting parameters.

Speaker 2

(2:52) Cache-busting parameters.

Speaker 1

(2:54) Yes, yeah. (2:55) They’re these random strings that developers sometimes add to resource URLs to force browsers to download the latest version.

Speaker 2

(3:01) Oh, I’ve seen those before.

Speaker 1

(3:02) Well, every time those parameters change, even if the content hasn’t, the Google bot has to re-download the entire thing.

Speaker 2

(3:08) So it’s like rewrapping the same gift over and over again. (3:10) Kind of pointless.

Speaker 1

(3:11) Exactly. (3:12) It’s wasteful and unnecessary. (3:14) And lastly, if you have a site with tons of media, you might consider hosting those resources on a separate hostname, like a CDN.

Speaker 2

(3:22) Ah, the trusty CDN. (3:24) So basically, you’re shifting the crawl budget burden away from your main site. (3:28) Smart.

Speaker 1

(3:28) You got it. (3:29) But keep in mind, that using a separate host name can sometimes affect page performance because the browser has to establish multiple connections. (3:36) There’s always a trade-off.

Speaker 2

(3:37) Right, got to balance those Google bot wins with good user experience. (3:41) Speaking of balancing, there’s something else you mentioned I wanted to ask about. (3:44) Disallowing resources in robots.txt, is that a good idea for the crawl budget?

Speaker 1

(3:48) Oh, definitely not. (3:50) Don’t even think about it.

Speaker 2

(3:51) Why is that?

Speaker 1

(3:51) Because if you block Googlebot from accessing those resources, it can’t render the page properly.

Speaker 2

(3:56) So you want to make sure Googlebot can see everything it needs to, right? (3:59) No robot baking disasters.

Speaker 1

(4:01) Precisely.

Speaker 2

(4:02) All right, so we’ve covered the basics of crawling. (4:05) What’s next on our crawling December adventure?

Speaker 1

(4:07) Let’s talk about HTTP caching, the magical land of site speed and resource savings.

Speaker 2

(4:13) OK, I am always up for that. (4:15) Everyone loves a fast website.

Speaker 1

(4:16) Right. (4:17) Google published a post, Crawling December, HTTP caching. (4:22) And they start with a bit of a reality check.

Speaker 2

(4:25) Oh, what’s that?

Speaker 1

(4:27) Apparently, only a tiny percentage of Google’s requests are actually served from the cache.

Speaker 2

(4:32) Really? (4:33) That’s surprising. (4:34) I figured Google would be all about caching.(4:36) They love speed and efficiency.

Speaker 1

(4:38) You’d think so, right? (4:39) But a lot of sites don’t implement caching at all, or they do it wrong or inconsistently. (4:44) So Google can’t always rely on it.

Speaker 2

(4:46) That makes sense.

Speaker 1

(4:47) Plus, content on the web is always changing, so that limits caching effectiveness, too.

Speaker 2

(4:52) So how do we make sure we’re doing caching right to make both Google and our users happy?

Speaker 1

(4:57) The key is to use those HTTP caching headers, specifically ETag and LastModified.

Speaker 2

(5:03) OK, those sound familiar. (5:04) What do they do again?

Speaker 1

(5:05) They act like timestamps, telling Googlebot if a resource has been updated since its last visit. (5:09) It’s like a little note saying, hey, this file hasn’t changed. (5:12) No need to download it again.

Speaker 2

(5:14) So we’re basically saving Googlebot time and effort, which helps our crawl budget.

Speaker 1

(5:18) Exactly. (5:19) Plus, it makes your site faster for users, too.

Speaker 2

(5:22) Win-win. (5:23) But which header is better, ETag or LastModified?

Speaker 1

(5:26) ETag is generally preferred because it’s more accurate, but using both is best practice.

Speaker 2

(5:31) So cover all your bases.

Speaker 1

(5:32) Exactly. (5:33) That way, if the ETag matches, your server can just send a 304 not-modified code, which tells Googlebot, hey, you already have the latest version.

Speaker 2

(5:41) So a nothing-to-see-here signal. (5:43) I like it. (5:43) Saves bandwidth and resources for everyone.

Speaker 1

(5:46) Precisely.

Speaker 2

(5:47) All right. (5:48) So efficient caching keeps everyone happy. (5:50) Now, on to a topic that can be a bit of a headache for SEOs, especially in e-commerce.

Speaker 1

(5:55) Ah, you must be talking about faceted navigation.

Speaker 2

(5:58) The one and only. (5:59) It’s great for users, but it can also be a tangled web of URLs for SEO.

Speaker 1

(6:04) Yeah, definitely a double-edged sword. (6:06) Luckily, Google addressed this in their crawling December faceted navigation post.

Speaker 2

(6:11) Good, did they have any helpful advice?

Speaker 1

(6:12) They did. (6:13) The good news is Google is getting much better at recognizing and handling faceted navigation.

Speaker 2

(6:18) Whew, that’s a relief.

Speaker 1

(6:20) Right. (6:20) They understand that filters create pages with very similar content, and they’re getting better at not indexing every single variation.

Speaker 2

(6:27) So we don’t have to completely freak out about faceted nav anymore.

Speaker 1

(6:30) Not exactly. (6:31) There are still some precautions you should take.

Speaker 2

(6:33) OK, like what?

Speaker 1

(6:34) First, make sure those faceted URLs are crawlable and indexable. (6:39) Don’t block them in robots.txt unless there’s a really good reason.

Speaker 2

(6:42) Makes sense. (6:44) Google needs to see them, even if they aren’t super unique.

Speaker 1

(6:46) Exactly. (6:47) And to help Google know which version of a page is the preferred one, use canonicalization.

Speaker 2

(6:52) Right, like a signpost for Google.

Speaker 1

(6:54) Yep. (6:55) Hey, Google, this is the main page. (6:57) Focus your attention here.

Speaker 2

(6:58) All right, so we’re guiding Google through the faceted navigation maze. (7:01) Anything else?

Speaker 1

(7:02) Well, while Google’s getting better with complex faceting schemes, it’s still best to keep yours streamlined.

Speaker 2

(7:08) Keep it simple. (7:09) Got it. (7:09) So faceted navigation doesn’t have to be a nightmare.(7:12) A few smart strategies, and everyone’s happy.

Speaker 1

(7:15) Exactly. (7:15) Now, ready to tackle our final topic for today, the one and only, CDN.

Speaker 2

(7:21) Oh, yeah. (7:22) CDNs, making websites load faster than a reindeer on Christmas Eve, always a popular topic in SEO. (7:29) What did Google have to say about CDNs and crawling on?

Speaker 1

(7:31) Well, in their crawling December CDNs and crawling post, they highlighted how CDNs can be a powerful ally in your SEO efforts.

Speaker 2

(7:39) I can see that. (7:40) Speed is king.

Speaker 1

(7:41) Right, and Google knows that. (7:43) They often give higher crawl rates to CDN IP addresses because they trust CDNs can handle it.

Speaker 2

(7:48) Makes sense. (7:49) CDNs have servers everywhere, so they can definitely handle a few extra visits from Googlebot.

Speaker 1

(7:53) Precisely. (7:54) But there is one potential pitfall you need to be aware of, the cold cache issue.

Speaker 2

(7:59) Cold cache.

Speaker 1

(8:00) Yeah. (8:01) On that first crawl, the CDN’s cache is empty, meaning it hasn’t stored a copy of the content yet.

Speaker 2

(8:08) So Googlebot comes knocking, and the CDN has to go fetch the content from the origin server.

Speaker 1

(8:13) You got it.

Speaker 2

(8:14) Which can slow things down, especially for big websites.

Speaker 1

(8:16) Exactly. (8:17) Like, if you’re launching a massive site, say a stock photo library with millions of images, that initial crawl could be a real strain on your server.

Speaker 2

(8:25) Millions of images. (8:27) Oof. (8:27) OK, so definitely something to be mindful of, that cold cache issue.

Speaker 1

(8:31) For sure, especially for large scale launches. (8:33) Now, another thing to consider with CDNs is how hosting resources on a separate host name can affect rendering.

Speaker 2

(8:39) Oh, right. (8:40) We talked about that earlier. (8:41) Good for crawl budget, but can potentially slow down page loading for users.

Speaker 1

(8:45) Yeah, exactly. (8:46) It’s because the browser has to establish multiple connections, which adds overhead.

Speaker 2

(8:49) Right, so it’s a trade-off. (8:51) Balance those Googlebot gains with a good user experience.

Speaker 1

(8:54) Always. (8:55) That’s where testing and optimization come in. (8:57) No one size fits all answer here.(8:59) You have to experiment.

Speaker 2

(9:01) Got it. (9:02) CDN hosting is awesome, but approach it strategically. (9:06) Now, before we move on, you mentioned earlier that CDNs can sometimes be overprotective.(9:11) What did you mean by that?

Speaker 1

(9:12) Oh, yeah. (9:13) Well, you see, CDNs are packed with security features, like web application firewalls, or WFs.

Speaker 2

(9:19) To block malicious traffic.

Speaker 1

(9:20) Right. (9:21) But sometimes, those WF can be a little too enthusiastic, and they accidentally block Googlebot.

Speaker 2

(9:26) No, that defeats the whole purpose.

Speaker 1

(9:28) I know, right? (9:29) So you have to be careful with that. (9:30) The most common types of locks are hard blocks, with errors like 5xx status codes or timeouts, signaling the site’s unavailable.

Speaker 2

(9:38) OK, those sound bad.

Speaker 1

(9:39) Then there are soft blocks. (9:41) That’s where the CDN returns a 200 status code, but includes an error message in the content.

Speaker 2

(9:47) Wait, a 200 status code with an error message? (9:50) That sounds super confusing for Google.

Speaker 1

(9:52) It is. (9:53) Google might think those errors are actual content, leading to duplicate content issues or indexing problems.

Speaker 2

(9:58) Yikes. (9:59) So what do we do?

Speaker 1

(10:00) Talk to your CDN provider and make sure they understand that Googlebot needs access.

Speaker 2

(10:04) Good point. (10:05) Didn’t Google publish its crawler IP addresses, too, so you can whitelist those?

Speaker 1

(10:08) Absolutely. (10:09) And if you ever think Googlebot’s being blocked, you can use the URL inspection tool in Search Console to see what Googlebot sees.

Speaker 2

(10:15) Ah, that’s super helpful. (10:17) So CDNs, mostly awesome, but watch out for those security settings.

Speaker 1

(10:21) Exactly. (10:22) Be proactive.

Speaker 2

(10:22) OK, so to recap what we’ve talked about so far. (10:25) Crawl budget is like a holiday budget. (10:27) Spend it wisely.(10:28) HTTP caching is like leaving a nothing to see here note for Googlebot. (10:33) Faceted navigation needs a guiding hand, a.k.a. canonicalization. (10:37) And CDNs are speedy, but watch out for that cold cache and overly aggressive security settings.

Speaker 1

(10:42) Perfect summary. (10:43) But how does all this translate into actual SEO success?

Speaker 2

(10:46) Yeah, what’s the big picture here?

Speaker 1

(10:48) All these factors, crawling, caching, faceted navigation, CDNs, they all contribute to making your website visible and accessible to Google.

Speaker 2

(10:59) Right, because if Google can find your content, or if it takes forever to load, all your SEO efforts are for nothing.

Speaker 1

(11:05) Exactly. (11:06) That’s why these seemingly technical aspects are so important. (11:09) They might not be as glamorous as keyword research or content creation, but they’re the foundation of good SEO.

Speaker 2

(11:16) Well said. (11:17) So we’ve covered a lot of ground today, but I think our listeners are eager to hear more.

Speaker 1

(11:21) Of course. (11:22) Shall we move on to some of the nuances of how CDNs can sometimes block Googlebot?

Speaker 2

(11:26) Let’s do it. (11:27) Let’s uncover those hidden CDN pitfalls.

Speaker 1

(11:30) So you know how we were just talking about CDNs and their security features sometimes being a little overprotective?

Speaker 2

(11:35) Yeah, those sneaky WiPos accidentally blocking Googlebot.

Speaker 1

(11:38) Yeah, well, it’s not just the security settings we have to watch out for. (11:41) Sometimes it’s those, are you a human interstitials that trip up Googlebot?

Speaker 2

(11:44) Oh, those things. (11:45) I hate those. (11:46) Even I struggle with them sometimes.(11:48) Click all the squares with a traffic light.

Speaker 1

(11:51) Exactly. (11:52) Googlebot can’t solve those puzzles either.

Speaker 2

(11:54) So it’s like getting all dressed up for a holiday party, but then getting stuck in the code check line all night.

Speaker 1

(12:00) Perfect analogy. (12:01) And just like a frustrated party guest, Googlebot might just give up and leave if it can’t get past those interstitials.

Speaker 2

(12:07) Oh, no. (12:08) So what do we do? (12:09) Disable them completely?

Speaker 1

(12:10) Not necessarily. (12:11) You just have to implement them in a way that doesn’t block Googlebot.

Speaker 2

(12:15) OK, how do we do that?

Speaker 1

(12:16) The best way is to send a 503 status code to bots when those interstitials are triggered.

Speaker 2

(12:22) A 503. (12:23) What does that tell Googlebot?

Speaker 1

(12:25) It basically says, hey, we’re just tidying up here. (12:27) Come back later when the party started.

Speaker 2

(12:29) Ah, so it’s like a temporary out-of-order sign?

Speaker 1

(12:31) Exactly. (12:32) Googlebot sees that and knows it’s not a permanent block.

Speaker 2

(12:35) Got it. (12:36) So interstitials, use them responsibly.

Speaker 1

(12:38) Yeah.

Speaker 2

(12:38) Now, I wanted to circle back to something we were talking about earlier, hosting resources on a separate host name with a CDN.

Speaker 1

(12:45) Oh, yeah, good point. (12:46) There’s sometimes this idea that it always slows down page load times for users, but it’s actually more nuanced than that.

Speaker 2

(12:52) Really? (12:53) Fill me in.

Speaker 1

(12:53) Well, it really depends on a few things. (12:56) The CDN configuration, the type of resources, the user’s internet connection. (13:01) For example, if the CDN has servers close to the user, it can actually make things faster.

Speaker 2

(13:07) Because the data has less distance to travel.

Speaker 1

(13:09) Exactly, like ordering takeout from across the street versus across town. (13:13) But the potential downside is that it can add a few extra milliseconds to the page load time because of things like DNS lookups and connection overhead.

Speaker 2

(13:21) So yet another balancing act for us SEOs.

Speaker 1

(13:24) Always. (13:25) You have to weigh the pros and cons and see what works best for your specific site.

Speaker 2

(13:28) Got it. (13:30) Test, analyze, optimize, the SEO mantra.

Speaker 1

(13:33) Pretty much.

Speaker 2

(13:34) Yeah.

Speaker 1

(13:34) Now, before we wrap up our crawling December deep dive, I think it’s good to zoom out and look at the big picture here.

Speaker 2

(13:39) Yeah, we’ve covered a lot of technical details today, but how does it all come together?

Speaker 1

(13:44) Well, the key takeaway is that all these things we talked about, crawling, caching, faceted navigation, CDNs, they all work together to make your website more visible and accessible to Google.

Speaker 2

(13:55) Right, because if Google can’t find your content or if it takes forever to load, then what’s the point of all our SEO efforts?

Speaker 1

(14:02) Exactly. (14:03) That’s why understanding these sometimes boring technical aspects is so important. (14:07) They’re the foundation of a solid SEO strategy.

Speaker 2

(14:10) Couldn’t agree more. (14:11) So let’s do a quick recap of each crawling December post just to make sure everyone’s on the same page.

Speaker 1

(14:16) Sounds good.

Speaker 2

(14:16) All right, first up, basics of crawling. (14:18) Remember, Googlebot is like a little data vacuum cleaner, gobbling up everything it can.

Speaker 1

(14:24) And that’s why managing your crawl budget is so crucial.

Speaker 2

(14:27) Absolutely, especially for bigger sites. (14:30) Minimize those resources, watch out for those pesky cache-busting parameters, and maybe think about a CDN for those media-heavy sites.

Speaker 1

(14:37) And don’t block anything important in your robots.txt. Let Googlebot see what it needs to see.

Speaker 2

(14:42) Robot baking disasters, no thank you. (14:45) OK, next up, HTTP caching. (14:47) While Google might not rely on it as much as we thought, it’s still essential for site speed and a good user experience.

Speaker 1

(14:53) Yep, and those ETag and last modified headers are your best friends. (14:57) Tell Googlebot when it can skip re-downloading things.

Speaker 2

(15:00) Efficient, speedy websites for everyone. (15:03) Now, who can forget about faceted navigation? (15:05) The e-commerce SEO challenge.

Speaker 1

(15:07) Thankfully, Google’s getting better at handling it, but we still have to manage those URLs, use canonicalization wisely, and keep those filtering schemes streamlined.

Speaker 2

(15:17) Keep it simple, keep it clean. (15:19) Last but not least, those wonderful CDNs. (15:22) They can be real lifesavers for website speed.

Speaker 1

(15:26) True, but remember to watch out for that cold cache, overprotective security settings, and think about how separate host names might affect your users.

Speaker 2

(15:35) So many things to consider.

Speaker 1

(15:37) I know, right? (15:37) But that’s the beauty of SEO. (15:39) It’s always evolving.

Speaker 2

(15:41) It keeps us on our toes. (15:42) Well, I think it’s time to wrap up our crawling December deep dive. (15:46) We’ve gone from the nitty gritty details of crawl budget to the big existential questions about AI and the future of search.

Speaker 1

(15:53) It has been quite a journey.

Speaker 2

(15:54) It really has. (15:56) Thanks for joining me today, and thanks to all our listeners for tuning in. (15:58) Happy holidays, everyone.

Speaker 1

(15:59) Happy holidays. (16:00) And remember, stay curious.

Final words

This post summarized the crawling December series that Google introduced this year which covered four different topics that are related to crawling. These topics are relevant to every website owner in some way or the other. For example, I personally use a CDN and it has really helped me with speed and delivery issues. And it saves me money. So it’s a win-win for me.

P.S. For transparency, both the summary and the key takeaways are created using NotebookLM. I have obviously edited the whole thing. The transcription is generated using TurboScribe and it may or may not be correct.