Tag Archive | "Crawl"

Internal Linking & Mobile First: Large Site Crawl Paths in 2018 & Beyond

Posted by Tom.Capper

By now, you’ve probably heard as much as you can bear about mobile first indexing. For me, there’s been one topic that’s been conspicuously missing from all this discussion, though, and that’s the impact on internal linking and previous internal linking best practices.

In the past, there have been a few popular methods for providing crawl paths for search engines — bulky main navigations, HTML sitemap-style pages that exist purely for internal linking, or blocks of links at the bottom of indexed pages. Larger sites have typically used at least two or often three of these methods. I’ll explain in this post why all of these are now looking pretty shaky, and what I suggest you do about it.

Quick refresher: WTF are “internal linking” & “mobile-first,” Tom?

Internal linking is and always has been a vital component of SEO — it’s easy to forget in all the noise about external link building that some of our most powerful tools to affect the link graph are right under our noses. If you’re looking to brush up on internal linking in general, it’s a topic that gets pretty complex pretty quickly, but there are a couple of resources I can recommend to get started:

I’ve also written in the past that links may be mattering less and less as a ranking factor for the most competitive terms, and though that may be true, they’re still the primary way you qualify for that competition.

A great example I’ve seen recently of what happens if you don’t have comprehensive internal linking is eflorist.co.uk. (Disclaimer: eFlorist is not a client or prospective client of Distilled, nor are any other sites mentioned in this post)

eFlorist has local landing pages for all sorts of locations, targeting queries like “Flower delivery in [town].” However, even though these pages are indexed, they’re not linked to internally. As a result, if you search for something like “flower delivery in London,” despite eFlorist having a page targeted at this specific query (which can be found pretty much only through use of advanced search operators), they end up ranking on page 2 with their “flowers under £30” category page:

¯\_(ツ)_/¯

If you’re looking for a reminder of what mobile-first indexing is and why it matters, these are a couple of good posts to bring you up to speed:

In short, though, Google is increasingly looking at pages as they appear on mobile for all the things it was previously using desktop pages for — namely, establishing ranking factors, the link graph, and SEO directives. You may well have already seen an alert from Google Search Console telling you your site has been moved over to primarily mobile indexing, but if not, it’s likely not far off.

Get to the point: What am I doing wrong?

If you have more than a handful of landing pages on your site, you’ve probably given some thought in the past to how Google can find them and how to make sure they get a good chunk of your site’s link equity. A rule of thumb often used by SEOs is how many clicks a landing page is from the homepage, also known as “crawl depth.”

Mobile-first indexing impacts this on two fronts:

  1. Some of your links aren’t present on mobile (as is common), so your internal linking simply won’t work in a world where Google is going primarily with the mobile-version of your page
  2. If your links are visible on mobile, they may be hideous or overwhelming to users, given the reduced on-screen real estate vs. desktop

If you don’t believe me on the first point, check out this Twitter conversation between Will Critchlow and John Mueller:

In particular, that section I’ve underlined in red should be of concern — it’s unclear how much time we have, but sooner or later, if your internal linking on the mobile version of your site doesn’t cut it from an SEO perspective, neither does your site.

And for the links that do remain visible, an internal linking structure that can be rationalized on desktop can quickly look overbearing on mobile. Check out this example from Expedia.co.uk’s “flights to London” landing page:

Many of these links are part of the site-wide footer, but they vary according to what page you’re on. For example, on the “flights to Australia” page, you get different links, allowing a tree-like structure of internal linking. This is a common tactic for larger sites.

In this example, there’s more unstructured linking both above and below the section screenshotted. For what it’s worth, although it isn’t pretty, I don’t think this is terrible, but it’s also not the sort of thing I can be particularly proud of when I go to explain to a client’s UX team why I’ve asked them to ruin their beautiful page design for SEO reasons.

I mentioned earlier that there are three main methods of establishing crawl paths on large sites: bulky main navigations, HTML-sitemap-style pages that exist purely for internal linking, or blocks of links at the bottom of indexed pages. I’ll now go through these in turn, and take a look at where they stand in 2018.

1. Bulky main navigations: Fail to scale

The most extreme example I was able to find of this is from Monoprice.com, with a huge 711 links in the sitewide top-nav:

Here’s how it looks on mobile:

This is actually fairly usable, but you have to consider the implications of having this many links on every page of your site — this isn’t going to concentrate equity where you need it most. In addition, you’re potentially asking customers to do a lot of work in terms of finding their way around such a comprehensive navigation.

I don’t think mobile-first indexing changes the picture here much; it’s more that this was never the answer in the first place for sites above a certain size. Many sites have tens of thousands (or more), not hundreds of landing pages to worry about. So simply using the main navigation is not a realistic option, let alone an optimal option, for creating crawl paths and distributing equity in a proportionate or targeted way.

2. HTML sitemaps: Ruined by the counterintuitive equivalence of noindex,follow & noindex,nofollow

This is a slightly less common technique these days, but still used reasonably widely. Take this example from Auto Trader UK:

This page isn’t mobile-friendly, although that doesn’t necessarily matter, as it isn’t supposed to be a landing page. The idea is that this page is linked to from Auto Trader’s footer, and allows link equity to flow through into deeper parts of the site.

However, there’s a complication: this page in an ideal world be “noindex,follow.” However, it turns out that over time, Google ends up treating “noindex,follow” like “noindex,nofollow.” It’s not 100% clear what John Mueller meant by this, but it does make sense that given the low crawl priority of “noindex” pages, Google could eventually stop crawling them altogether, causing them to behave in effect like “noindex,nofollow.” Anecdotally, this is also how third-party crawlers like Moz and Majestic behave, and it’s how I’ve seen Google behave with test pages on my personal site.

That means that at best, Google won’t discover new links you add to your HTML sitemaps, and at worst, it won’t pass equity through them either. The jury is still out on this worst case scenario, but it’s not an ideal situation in either case.

So, you have to index your HTML sitemaps. For a large site, this means you’re indexing potentially dozens or hundreds of pages that are just lists of links. It is a viable option, but if you care about the quality and quantity of pages you’re allowing into Google’s index, it might not be an option you’re so keen on.

3. Link blocks on landing pages: Good, bad, and ugly, all at the same time

I already mentioned that example from Expedia above, but here’s another extreme example from the Kayak.co.uk homepage:

Example 1

Example 2

It’s no coincidence that both these sites come from the travel search vertical, where having to sustain a massive number of indexed pages is a major challenge. Just like their competitor, Kayak have perhaps gone overboard in the sheer quantity here, but they’ve taken it an interesting step further — notice that the links are hidden behind dropdowns.

This is something that was mentioned in the post from Bridget Randolph I mentioned above, and I agree so much I’m just going to quote her verbatim:

Note that with mobile-first indexing, content which is collapsed or hidden in tabs, etc. due to space limitations will not be treated differently than visible content (as it may have been previously), since this type of screen real estate management is actually a mobile best practice.

Combined with a more sensible quantity of internal linking, and taking advantage of the significant height of many mobile landing pages (i.e., this needn’t be visible above the fold), this is probably the most broadly applicable method for deep internal linking at your disposal going forward. As always, though, we need to be careful as SEOs not to see a working tactic and rush to push it to its limits — usability and moderation are still important, just as with overburdened main navigations.

Summary: Bite the on-page linking bullet, but present it well

Overall, the most scalable method for getting large numbers of pages crawled, indexed, and ranking on your site is going to be on-page linking — simply because you already have a large number of pages to place the links on, and in all likelihood a natural “tree” structure, by very nature of the problem.

Top navigations and HTML sitemaps have their place, but lack the scalability or finesse to deal with this situation, especially given what we now know about Google’s treatment of “noindex,follow” tags.

However, the more we emphasize mobile experience, while simultaneously relying on this method, the more we need to be careful about how we present it. In the past, as SEOs, we might have been fairly nervous about placing on-page links behind tabs or dropdowns, just because it felt like deceiving Google. And on desktop, that might be true, but on mobile, this is increasingly going to become best practice, and we have to trust Google to understand that.

All that said, I’d love to hear your strategies for grappling with this — let me know in the comments below!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

More Articles

Posted in Latest NewsComments Off

NEW On-Demand Crawl: Quick Insights for Sales, Prospecting, & Competitive Analysis

Posted by Dr-Pete

In June of 2017, Moz launched our entirely rebuilt Site Crawl, helping you dive deep into crawl issues and technical SEO problems, fix those issues in your Moz Pro Campaigns (tracked websites), and monitor weekly for new issues. Many times, though, you need quick insights outside of a Campaign context, whether you’re analyzing a prospect site before a sales call or trying to assess the competition.

For years, Moz had a lab tool called Crawl Test. The bad news is that Crawl Test never made it to prime-time and suffered from some neglect. The good news is that I’m happy to announce the full launch (as of August 2018) of On-Demand Crawl, an entirely new crawl tool built on the engine that powers Site Crawl, but with a UI designed around quick insights for prospecting and competitive analysis.

While you don’t need a Campaign to run a crawl, you do need to be logged into your Moz Pro subscription. If you don’t have a subscription, you can sign-up for a free trial and give it a whirl.

How can you put On-Demand Crawl to work? Let’s walk through a short example together.


All you need is a domain

Getting started is easy. From the “Moz Pro” menu, find “On-Demand Crawl” under “Research Tools”:

Just enter a root domain or subdomain in the box at the top and click the blue button to kick off a crawl. While I don’t want to pick on anyone, I’ve decided to use a real site. Our recent analysis of the August 1st Google update identified some sites that were hit hard, and I’ve picked one (lilluna.com) from that list.

Please note that Moz is not affiliated with Lil’ Luna in any way. For the most part, it seems to be a decent site with reasonably good content. Let’s pretend, just for this post, that you’re looking to help this site out and determine if they’d be a good fit for your SEO services. You’ve got a call scheduled and need to spot-check for any major problems so that you can go into that call as informed as possible.

On-Demand Crawls aren’t instantaneous (crawling is a big job), but they’ll generally finish between a few minutes and an hour. We know these are time-sensitive situations. You’ll soon receive an email that looks like this:

The email includes the number of URLs crawled (On-Demand will currently crawl up to 3,000 URLs), the total issues found, and a summary table of crawl issues by category. Click on the [View Report] link to dive into the full crawl data.


Assess critical issues quickly

We’ve designed On-Demand Crawl to assist your own human intelligence. You’ll see some basic stats at the top, but then immediately move into a graph of your top issues by count. The graph only displays issues that occur at least once on your site – you can click “See More” to show all of the issues that On-Demand Crawl tracks (the top two bars have been truncated)…

Issues are also color-coded by category. Some items are warnings, and whether they matter depends a lot on context. Other issues, like “Critcal Errors” (in red) almost always demand attention. So, let’s check out those 404 errors. Scroll down and you’ll see a list of “Pages Crawled” with filters. You’re going to select “4xx” in the “Status Codes” dropdown…

You can then pretty easily spot-check these URLs and find out that they do, in fact, seem to be returning 404 errors. Some appear to be legitimate content that has either internal or external links (or both). So, within a few minutes, you’ve already found something useful.

Let’s look at those yellow “Meta Noindex” errors next. This is a tricky one, because you can’t easily determine intent. An intentional Meta Noindex may be fine. An unintentional one (or hundreds of unintentional ones) could be blocking crawlers and causing serious harm. Here, you’ll filter by issue type…

Like the top graph, issues appear in order of prevalence. You can also filter by all pages that have issues (any issues) or pages that have no issues. Here’s a sample of what you get back (the full table also includes status code, issue count, and an option to view all issues)…

Notice the “?s=” common to all of these URLs. Clicking on a few, you can see that these are internal search pages. These URLs have no particular SEO value, and the Meta Noindex is likely intentional. Good technical SEO is also about avoiding false alarms because you lack internal knowledge of a site. On-Demand Crawl helps you semi-automate and summarize insights to put your human intelligence to work quickly.


Dive deeper with exports

Let’s go back to those 404s. Ideally, you’d like to know where those URLs are showing up. We can’t fit everything into one screen, but if you scroll up to the “All Issues” graph you’ll see an “Export CSV” option…

The export will honor any filters set in the page list, so let’s re-apply that “4xx” filter and pull the data. Your export should download almost immediately. The full export contains a wealth of information, but I’ve zeroed in on just what’s critical for this particular case…

Now, you know not only what pages are missing, but exactly where they link from internally, and can easily pass along suggested fixes to the customer or prospect. Some of these turn out to be link-heavy pages that could probably benefit from some clean-up or updating (if newer recipes are a good fit).

Let’s try another one. You’ve got 8 duplicate content errors. Potentially thin content could fit theories about the August 1st update, so this is worth digging into. If you filter by “Duplicate Content” issues, you’ll see the following message…

The 8 duplicate issues actually represent 18 pages, and the table returns all 18 affected pages. In some cases, the duplicates will be obvious from the title and/or URL, but in this case there’s a bit of mystery, so let’s pull that export file. In this case, there’s a column called “Duplicate Content Group,” and sorting by it reveals something like the following (there’s a lot more data in the original export file)…

I’ve renamed “Duplicate Content Group” to just “Group” and included the word count (“Words”), which could be useful for verifying true duplicates. Look at group #7 – it turns out that these “Weekly Menu Plan” pages are very image heavy and have a common block of text before any unique text. While not 100% duplicated, these otherwise valuable pages could easily look like thin content to Google and represent a broader problem.


Real insights in real-time

Not counting the time spent writing the blog post, running this crawl and diving in took less than an hour, and even that small amount of time spent uncovered more potential issues than what I could cover in this post. In less than an hour, you can walk into a client meeting or sales call with in-depth knowledge of any domain.

Keep in mind that many of these features also exist in our Site Crawl tool. If you’re looking for long-term, campaign insights, use Site Crawl (if you just need to update your data, use our “Recrawl” feature). If you’re looking for quick, one-time insights, check out On-Demand Crawl. Standard Pro users currently get 5 On-Demand Crawls per month (with limits increasing at higher tiers).

Your On-Demand Crawls are currently stored for 90 days. When you re-enter the feature, you’ll see a table of all of your recent crawls (the image below has been truncated):

Click on any row to go back to see the crawl data for that domain. If you get the sale and decide to move forward, congratulations! You can port that domain directly into a Moz campaign.

We hope you’ll try On-Demand Crawl out and let us know what you think. We’d love to hear your case studies, whether it’s sales, competitive analysis, or just trying to solve the mysteries of a Google update.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

SearchCap: Crawl budget, content development & shopping ads

Below is what happened in search today, as reported on Search Engine Land and from other places across the web.

The post SearchCap: Crawl budget, content development & shopping ads appeared first on Search Engine Land.



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in Latest NewsComments Off

SearchCap: Linking law, Google crawl change & map directories

Below is what happened in search today, as reported on Search Engine Land and from other places across the web.

The post SearchCap: Linking law, Google crawl change & map directories appeared first on Search Engine Land.



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in Latest NewsComments Off

Announcing 5 NEW Feature Upgrades to Moz Pro’s Site Crawl, Including Pixel-Length Title Data

Posted by Dr-Pete

While Moz is hard at work on some major new product features (we’re hoping for two more big launches in 2017), we’re also working hard to iterate on recent advances. I’m happy to announce that, based on your thoughtful feedback and our own ever-growing wish lists, we’ve recently launched five upgrades to our Site Crawl.

Check it out!

1. Mark Issues as Fixed

It’s fine to ignore issues that don’t matter to your site or business, but many of you asked for a way to audit fixes or just let us know that you’ve made a fix prior to our next data update. So, from any issues page, you can now select items and “Mark as fixed” (screens below edited for content).

Fixed items will immediately be highlighted and, like Ignored issues, can be easily restored…

Unlike the “Ignore” feature, we’ll also monitor these issues for you and warn you if they reappear. In a perfect world, you’d fix an issue once and be done, but we all know that real web development just doesn’t work out that way.

2. View/Ignore/Fix More Issues

When we launched the “Ignore” feature, many of you were very happy (it was, frankly, long overdue), until you realized you could only ignore issues in chunks of 25 at a time. We have heard you loud and clear (seriously, Carl, stop calling) and have taken two steps. First, you can now view, ignore, and fix issues 100 at a time. This is the default – no action or extra clicks required.

3. Ignore Issues by Type

Second, you can now ignore entire issue types. Let’s say, for example, that Moz.com intentionally has 33,000 Meta Noindex tags (for example). We really don’t need to be reminded of that every week. So, once we make sure none of those are unintentional, we can go to the top of the issue page and click “Ignore Issue Type”:

Look for this in the upper-right of any individual issue page. Just like individual issues, you can easily track all of your ignored issues and start paying attention to them again at any time. We just want to help you clear out the noise so that you can focus on what really matters to you.

4. Pixel-length Title Data

For years now, we’ve known that Google cut display titles by pixel length. We’ve provided research on this subject and have built our popular title tag checker around pixel length, but providing this data at product scale proved to be challenging. I’m happy to say that we’ve finally overcome those challenges, and “Pixel Length” has replaced Character Length in our title tag diagnostics.

Google currently uses a 600-pixel container, but you may notice that you receive warnings below that length. Due to making space to add the “…” and other considerations, our research has shown that the true cut-off point that Google uses is closer to 570 pixels. Site Crawl reflects our latest research on the subject.

As with other issues, you can export the full data to CSV, to sort and filter as desired:

Looks like we’ve got some work to do when it comes to brevity. Long title tags aren’t always a bad thing, but this data will help you much better understand how and when Google may be cutting off your display titles in SERPs and decide whether you want to address it in specific cases.

5. Full Issue List Export

When we rebuilt Site Crawl, we were thrilled to provide data and exports on all pages crawled. Unfortunately, we took away the export of all issues (choosing to divide those up into major issue types). Some of you had clearly come to rely on the all issues export, and so we’ve re-added that functionality. You can find it next to “All Issues” on the main “Site Crawl Overview” page:

We hope you’ll try out all of the new features and report back as we continue to improve on our Site Crawl engine and UI over the coming year. We’d love to hear what’s working for you and what kind of results you’re seeing as you fix your most pressing technical SEO issues.

Find & fix your site issues now

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

New Site Crawl: Rebuilt to Find More Issues on More Pages, Faster Than Ever!

Posted by Dr-Pete

First, the good news — as of today, all Moz Pro customers have access to the new version of Site Crawl, our entirely rebuilt deep site crawler and technical SEO auditing platform. The bad news? There isn’t any. It’s bigger, better, faster, and you won’t pay an extra dime for it.

A moment of humility, though — if you’ve used our existing site crawl, you know it hasn’t always lived up to your expectations. Truth is, it hasn’t lived up to ours, either. Over a year ago, we set out to rebuild the back end crawler, but we realized quickly that what we wanted was an entirely re-imagined crawler, front and back, with the best features we could offer. Today, we launch the first version of that new crawler.

Code name: Aardwolf

The back end is entirely new. Our completely rebuilt “Aardwolf” engine crawls twice as fast, while digging much deeper. For larger accounts, it can support up to ten parallel crawlers, for actual speeds of up to 20X the old crawler. Aardwolf also fully supports SNI sites (including Cloudflare), correcting a major shortcoming of our old crawler.

View/search *all* URLs

One major limitation of our old crawler is that you could only see pages with known issues. Click on “All Crawled Pages” in the new crawler, and you’ll be brought to a list of every URL we crawled on your site during the last crawl cycle:

You can sort this list by status code, total issues, Page Authority (PA), or crawl depth. You can also filter by URL, status codes, or whether or not the page has known issues. For example, let’s say I just wanted to see all of the pages crawled for Moz.com in the “/blog” directory…

I just click the [+], select “URL,” enter “/blog,” and I’m on my way.

Do you prefer to slice and dice the data on your own? You can export your entire crawl to CSV, with additional data including per-page fetch times and redirect targets.

Recrawl your site immediately

Sometimes, you just can’t wait a week for a new crawl. Maybe you relaunched your site or made major changes, and you have to know quickly if those changes are working. No problem, just click “Recrawl my site” from the top of any page in the Site Crawl section, and you’ll be on your way…

Starting at our Medium tier, you’ll get 10 recrawls per month, in addition to your automatic weekly crawls. When the stakes are high or you’re under tight deadlines for client reviews, we understand that waiting just isn’t an option. Recrawl allows you to verify that your fixes were successful and refresh your crawl report.

Ignore individual issues

As many customers have reminded us over the years, technical SEO is not a one-sized-fits-all task, and what’s critical for one site is barely a nuisance for another. For example, let’s say I don’t care about a handful of overly dynamic URLs (for many sites, it’s a minor issue). With the new Site Crawl, I can just select those issues and then “Ignore” them (see the green arrow for location):

If you make a mistake, no worries — you can manage and restore ignored issues. We’ll also keep tracking any new issues that pop up over time. Just because you don’t care about something today doesn’t mean you won’t need to know about it a month from now.

Fix duplicate content

Under “Content Issues,” we’ve launched an entirely new duplicate content detection engine and a better, cleaner UI for navigating that content. Duplicate content is now automatically clustered, and we do our best to consistently detect the “parent” page. Here’s a sample from Moz.com:

You can view duplicates by the total number of affected pages, PA, and crawl depth, and you can filter by URL. Click on the arrow (far-right column) for all of the pages in the cluster (shown in the screenshot). Click anywhere in the current table row to get a full profile, including the source page we found that link on.

Prioritize quickly & tactically

Prioritizing technical SEO problems requires deep knowledge of a site. In the past, in the interest of simplicity, I fear that we’ve misled some of you. We attempted to give every issue a set priority (high, medium, or low), when the difficult reality is that what’s a major problem on one site may be deliberate and useful on another.

With the new Site Crawl, we decided to categorize crawl issues tactically, using five buckets:

  • Critical Crawler Issues
  • Crawler Warnings
  • Redirect Issues
  • Metadata Issues
  • Content Issues

Hopefully, you can already guess what some of these contain. Critical Crawler Issues still reflect issues that matter first to most sites, such as 5XX errors and redirects to 404s. Crawler Warnings represent issues that might be very important for some sites, but require more context, such as meta NOINDEX.

Prioritization often depends on scope, too. All else being equal, one 500 error may be more important than one duplicate page, but 10,000 duplicate pages is a different matter. Go to the bottom of the Site Crawl Overview Page, and we’ve attempted to balance priority and scope to target your top three issues to fix:

Moving forward, we’re going to be launching more intelligent prioritization, including grouping issues by folder and adding data visualization of your known issues. Prioritization is a difficult task and one we haven’t helped you do as well as we could. We’re going to do our best to change that.

Dive in & tell us what you think!

All existing customers should have access to the new Site Crawl as of earlier this morning. Even better, we’ve been crawling existing campaigns with the Aardwolf engine for a couple of weeks, so you’ll have history available from day one! Stay tuned for a blog post tomorrow on effectively prioritizing Site Crawl issues, and be sure to register for the upcoming webinar.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Related Articles

Posted in Latest NewsComments Off

Site Crawl, Day 1: Where Do You Start?

Posted by Dr-Pete

When you’re faced with the many thousands of potential issues a large site can have, where do you start? This is the question we tried to tackle when we rebuilt Site Crawl. The answer depends almost entirely on your site and can require deep knowledge of its history and goals, but I’d like to outline a process that can help you cut through the noise and get started.

Simplistic can be dangerous

Previously, we at Moz tried to label every issue as either high, medium, or low priority. This simplistic approach can be appealing, even comforting, and you may be wondering why we moved away from it. This was a very conscious decision, and it boils down to a couple of problems.

First, prioritization depends a lot on your intent. Misinterpreting your intent can lead to bad advice that ranges from confusing to outright catastrophic. Let’s say, for example, that we hired a brand-new SEO at Moz and they saw the following issue count pop up:

Almost 35,000 NOINDEX tags?! WHAT ABOUT THE CHILDREN?!!

If that new SEO then rushed to remove those tags, they’d be doing a lot of damage, not realizing that the vast majority of those directives are intentional. We can make our systems smarter, but they can’t read your mind, so we want to be cautious about false alarms.

Second, bucketing issues by priority doesn’t do much to help you understand the nature of those problems or how to go about fixing them. We now categorize Site Crawl issues into one of five descriptive types:

  • Critical Crawler Issues
  • Crawler Warnings
  • Redirect Issues
  • Metadata Issues
  • Content Issues

Categorizing by type allows you to be more tactical. The issues in our new “Redirect” category, for example, are going to have much more in common, which means they potentially have common fixes. Ultimately, helping you find problems is just step one. We want to do a better job at helping you fix them.

1. Start with Critical Crawler Issues

That’s not to say everything is subjective. Some problems block crawlers (not just ours, but search engines) from getting to your pages at all. We’ve grouped these “Critical Crawler Issues” into our first category, and they currently include 5XX errors, 4XX errors, and redirects to 4XX. If you have a sudden uptick in 5XX errors, you need to know, and almost no one intentionally redirects to a 404.

You’ll see Critical Crawler Issues highlighted throughout the Site Crawl interface:

Look for the red alert icon to spot critical issues quickly. Address these problems first. If a page can’t be crawled, then every other crawler issue is moot.

2. Balance issues with prevalence

When it comes to solving your technical SEO issues, we also have to balance severity with quantity. Knowing nothing else about your site, I would say that a 404 error is probably worth addressing before duplicate content — but what if you have eleven 404s and 17,843 duplicate pages? Your priorities suddenly look very different.

At the bottom of the Site Crawl home, check out “Moz Recommends Fixing”:

We’ve already done some of the math for you, weighting urgency by how prevalent the issue is. This does require some assumptions about prioritization, but if your time is limited, we hope it at least gives you a quick starting point to solve a couple of critical issues.

3. Solve multi-page issues

There’s another advantage to tackling issues with high counts. In many cases, you might be able to solve issues on hundreds (or even thousands) of pages with a single fix. This is where a more tactical approach can save you a lot of time and money.

Let’s say, for example, that I want to dig into my 916 pages on Moz.com missing meta descriptions. I immediately notice that some of these pages are blog post categories. So, I filter by URL:

I can quickly see that these pages account for 392 of my missing descriptions — a whopping 43% of them. If I’m concerned about this problem, then it’s likely that I could solve it with a fairly simple CMS page, wiping out hundreds of issues with a few lines of code.

In the near future, we hope to do some of this analysis for you, but if filtering isn’t doing the job, you can also export any list of issues to CSV. Then, pivot and filter to your heart’s content.

4. Dive into pages by PA & crawl depth

If you can’t easily spot clear patterns, or if you’ve solved some of those big issues, what next? Fixing thousands of problems one URL at a time is only worthwhile if you know those URLs are important.

Fortunately, you can now sort by Page Authority (PA) and Crawl Depth in Site Crawl. PA is our own internal metric of ranking ability (primarily powered by link equity), and Crawl Depth is the distance of a page from the home-page:

Here, I can see that there’s a redirect chain in one of our MozBar URLs, which is a very high-authority page. That’s probably one worth fixing, even if it isn’t part of an obvious, larger group.

5. Watch for spikes in new issues

Finally, as time goes on, you’ll also want to be alert to new issues, especially if they appear in large numbers. This could indicate a sudden and potentially damaging change. Site Crawl now makes tracking new issues easy, including alert icons, graphs, and a quick summary of new issues by category:

Any crawl is going to uncover some new pages (the content machine never rests), but if you’re suddenly seeing hundreds of new issues of a single type, it’s important to dig in quickly and make sure nothing’s wrong. In a perfect world, the SEO team would always know what changes other people and teams made to the site, but we all know it’s not a perfect world.

I hope this gives you at least a few ideas for how to quickly dive into your site’s technical SEO issues. If you’re an existing customer, you already have access to Moz’s new Site Crawl and all of the features discussed in this post.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

Tackling Tag Sprawl: Crawl Budget, Duplicate Content, and User-Generated Content

Posted by rjonesx.

Alright, so here’s the situation. You have a million-product website. Your competitors have a lot of the same products. You need unique content. What do you do? The same thing everyone does — you turn to user-generated content. Problem solved, right?

User-generated content (UGC) can be an incredibly valuable source of content and organization, helping you build natural language descriptions and human-driven organization of site content. One common feature used by sites to take advantage of user-created content are tags, found everywhere from e-commerce sites to blogs. Webmasters can leverage tags to power site search, create taxonomies and categories of products for browsing, and to provide rich descriptions of site content.

This is a logical and practical approach, but can cause intractable SEO problems if left unchecked. For mega-sites, manually moderating millions of user-submitted tags can be cumbersome (if not wholly impossible). Leaving tags unchecked, though, can create massive problems with thin content, duplicate content, and general content sprawl. In our case study below, three technical SEOs from different companies joined forces to solve a massive tag sprawl problem. The project was led by Jacob Bohall, VP of Marketing at Hive Digital, while computational statistics services were provided by J.R. Oakes of Adapt Partners and Russ Jones of Moz. Let’s dive in.

What is tag sprawl?

We define tag sprawl as the unchecked growth of unique, user-contributed tags resulting in a large amount of near-duplicate pages and unnecessary crawl space. Tag sprawl generates URLs likely to be classified as doorway pages, pages appearing to exist only for the purpose of building an index across an exhaustive array of keywords. You’ve probably seen this in its most basic form in the tagging of posts across blogs, which is why most SEOs recommend a blanket “noindex, follow” across tag pages in WordPress sites. This simple approach can be an effective solution for small blog sites, but is not often the solution for major e-commerce sites that rely more heavily on tags for categorizing products.

The three following tag clouds represent a list of user-generated terms associated with different stock photos. Note: User behavior is generally to place as many tags as possible in an attempt to ensure maximum exposure for their products.

  1. USS Yorktown, Yorktown, cv, cvs-10, bonhomme richard, revolutionary war-ships, war-ships, naval ship, military ship, attack carriers, patriots point, landmarks, historic boats, essex class aircraft carrier, water, ocean
  2. ship, ships, Yorktown, war boats, Patriot pointe, old war ship, historic landmarks, aircraft carrier, war ship, naval ship, navy ship, see, ocean
  3. Yorktown ship, Warships and aircraft carriers, historic military vessels, the USS Yorktown aircraft carrier

As you can see, each user has generated valuable information for the photos, which we would want to use as a basis for creating indexable taxonomies for related stock images. However, at any type of scale, we have immediate threats of:

  • Thin content: Only a handful of products share the user-generated tag when a user creates a more specific/defining tag, e.g. “cvs-10″
  • Duplicate and similar content: Many of these tags will overlap, e.g. “USS Yorktown” vs. “Yorktown,” “ship” vs. “ships,” “cv” vs. “cvs-10,” etc.
  • Bad content: Created by improper formatting, misspellings, verbose tags, hyphenation, and similar mistakes made by users.

Now that you understand what tag sprawl is and how it negatively effects your site, how can we address this issue at scale?

The proposed solution

In correcting tag sprawl, we have some basic (at the surface) problems to solve. We need to effectively review each tag in our database and place them in groups so further action can be taken. First, we determine the quality of a tag (how likely is someone to search for this tag, is it spelled correctly, is it commercial, is it used for many products) and second, we determine if there is another tag very similar to it that has a higher quality.

  1. Identify good tags: We defined a good tag as term capable of contributing meaning, and easily justifiable as an indexed page in search results. This also entailed identifying a “master” tag to represent groups of similar terms.
  2. Identify bad tags: We wanted to isolate tags that should not appear in our database due to misspellings, duplicates, poor format, high ambiguity, or likely to cause a low-quality page.
  3. Relate bad tags to good tags: We assumed many of our initial “bad tags” could be a range of duplicates, i.e. plural/singular, technical/slang, hyphenated/non-hyphenated, conjugations, and other stems. There could also be two phrases which refer to the same thing, like “Yorktown ship” vs. “USS Yorktown.” We need to identify these relationships for every “bad” tag.

For the project inspiring this post, our sample tag database comprised over 2,000,000 “unique” tags, making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or similar platform to get “manual” review, early tests of this method proved to be unsuccessful. We would need a programmatic method (several methods, in fact) that we could later reproduce when adding new tags.

The methods

Keeping the goal in mind of identifying good tags, labeling bad tags, and relating bad tags to good tags, we employed more than a dozen methods, including: spell correction, bid value, tag search volume, unique visitors, tag count, Porter stemming, lemmatization, Jaccard index, Jaro-Winkler distance, Keyword Planner grouping, Wikipedia disambiguation, and K-Means clustering with word vectors. Each method either helped us determine whether the tag was valuable and, if not, helped us identify an alternate tag that was valuable.

Spell correction

  • Method: One of the obvious issues with user-generated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter “L” or words have unintended characters at the beginning or end. Luckily, Linux has an excellent built-in spell checker called Aspell which we were able to use to fix a large volume of issues.
  • Benefits: This offered a quick, early win in that it was fairly easy to identify bad tags when they were composed of words that weren’t included in the dictionary or included characters that were simply inexplicable (like a semicolon in the middle of a word). Moreover, if the corrected word or phrase occurred in the tag list, we could trust the corrected phrase as a potentially good tag, and relate the misspelled term to the good tag. Thus, this method help us both filter bad tags (misspelled terms) and find good tags (the spell-corrected term)
  • Limitations: The biggest limitation with this methodology was that combinations of correctly spelled words or phrases aren’t necessarily useful for users or the search engine. For example, many of the tags in the database were concatenations of multiple tags where the user space-delimited rather than comma-delimited their submitted tags. Thus, a tag might consist of correctly spelled terms but still be useless in terms of search value. Moreover, there were substantial dictionary limitations, especially with domain names, brand names, and Internet slang. In order to accommodate this, we added a personal dictionary that included a list of the top 10,000 domains according to Quantcast, several thousand brands, and a slang dictionary. While this was helpful, there were still several false recommendations that needed to be handled. For example, we saw “purfect” correct to “perfect,” despite being a pop-culture reference for cat images. We also noticed some users reference this saying as “purrfect,” “purrrfect,” “purrrrfect,” “purrfeck,” etc. Ultimately, we had to rely on other metrics to determine whether we trusted the misspelling recommendations.

Bid value

  • Method: While a tag might be good in the sense that it is descriptive, we wanted tags that were commercially relevant. Using the estimated cost-per-click of the tag or tag phrase proved useful in making sure that the term could attract buyers, not just visitors.
  • Benefits: One of the great features of this methodology is that it tends to have a high signal-to-noise ratio. Most tags that have high CPCs tend to be commercially relevant and searched frequently enough to warrant inclusion as “good tags.” In many cases we could feel confident that a tag was good just on this metric alone.
  • Limitations: However, the bid value metric comes with some pretty big limitations, too. For starters, Google Keyword Planner’s disambiguation problem is readily apparent. Google combines related keywords together when reporting search volume and CPC data, which means a tag like “facbook” would return the same data as “facebook.” Obviously, we would prefer to map “facbook” to “facebook” rather than keep both tags, so in some cases the CPC metric wasn’t sufficient to identify good tags. A further limitation of the bid value was the difficulty of acquiring CPC data. Google now requires running active Adwords campaigns to get access to CPC value. It is no simple feat to look up 5,000,000 keywords in Google Keyword Planner, even if you have a sufficient account. Luckily, we felt comfortable that historical data would be trustworthy enough, so we didn’t need to acquire fresh data.

Tag search volume

  • Method: Similar to CPC, we could use search volume to determine the potential value of a tag. We had to be careful not to rely on the tag itself, though, since the tag could be so generic that it earns traffic unrelated to the product itself. For example, the tag “USS Yorktown” might get a few hundred searches a month, but “USS Yorktown T-shirt” gets 0. For all of the tags in our index, we tracked down the search volume for the tag plus the product name, in order to make sure we had good estimates of potential product traffic.
  • Benefits: Like CPC, this metric did a very good job of consolidating our tag data set to just keywords that were likely to deliver traffic. In the vast majority of cases, if “tag + product” had search volume, we could feel confident that it is a good term.
  • Limitations: Unfortunately, this method fell victim to the same disambiguation problem that CPC presents. Because Google groups terms together, it is possible that on some occasions two tags will be given the same metrics. For example: “pontoons boat,” “pontoonboat,” “pontoon boats,” “pontoon boat,” “pontoon boating,” and “pontoons boats” were in the same traffic volume group which also included tags like “yacht” and “yachts.” Moreover, there is no accounting for keyword difficulty in this metric. Some tags, when combined with product types, produce keywords that receive substantial traffic but will always be out of reach for a templated tag page.

Unique visitors

  • Method: This method was a no-brainer: protect the tags that already receive traffic from Google. We exported all of the tags from Google Analytics that had received search traffic from Google in the last 12 months. Generally speaking, this should be a fairly safe list of terms.
  • Benefits: When doing experimental work with a client, it is always nice to be able to give them a scenario that almost guarantees improvement. Because we were able to protect tags that already receive traffic by labeling them as good (in the vast majority of cases), we could ensure that the client had a high probability of profiting from the changes we made and minimal risk of any traffic loss.
  • Limitations: Unfortunately, even this method wasn’t perfect. If a product (or set of products) with high enough authority included a poor variation of a tag, then the bad variant would rank and receive traffic. We had to use other strategies to verify our selections from this method and devise a method to encourage a tag swap in the index for the correct version of a term.

Tag count

  • Description: The frequency with which a tag was used on the site was often a strong signal that we could trust the tag, especially when compared with other similar tags. By counting the number of times each tag was used on the site, we could bias our final set of trusted tags in favor of these more popular terms.
  • Benefits: This was a great tie-breaker metric when we had two tags that were very similar but needed to choose just one. For example, sometimes two variants of a phrase were completely acceptable (such as a version with and without a hyphen). We could simply defer to the one with a higher tag count.
  • Limitations: The clear limitation of tag frequency is that many of the most frequent tags were too generic to be useful. The tag “blue” isn’t particularly useful when it just helps people find “blue t-shirts.” The term is too generic and too competitive to warrant inclusion. Additionally, the inclusion of too broad of a tag would simply create a very large crawl vs. traffic-potential ratio. A common tag will have hundreds if not thousands of matching products, creating many pages of products for the single tag. If a tag produces 50 paginated product listings, but only has the potential to drive 10 visitors a year, it might not be worth it.

Porter stemming

  • Method: Stemming is a method used to identify the root word from a tag by scanning the word right to left and using various pattern matching rules to remove characters (suffixes) until you arrive at the word’s stem. There are a couple of popular stemmers available, but we found Porter stemming to be more accurate as a tool for seeing alternative word forms. You can geek out by looking at the Porter stemming algorithm in Snowball here, or you can play with a JS version here.
  • Benefits: Plural and possessive terms can be grouped by their stem for further analysis. Running Porter stemming on the terms “pony” and “ponies” will return “poni” as the stem, which can then be used to group terms for further analysis. You can also run Porter stemming on phrases. For example, “boating accident,” “boat accidents,” “boating accidents,” etc. share the stem “boat accid.” This can be a crude and quick method for grouping variations. Porter stemming also is able to clean text more kindly, where others stemmers can be too aggressive for our efforts; e.g., Lancaster stemmer reduces “woman” to “wom,” while Porter stemmer leaves it as “woman.”
  • Limitations: Stemming is intended for finding a common root for terms and phrases, and does not create any type of indication as to the proper form of a term. The Porter stemming method applies a fixed set of rules to the English language by blanket removing trailing “s,” “e,” “ance,” “ing,” and similar word endings to try and find the stem. For this to work well, you have to have all of the correct rules (and exceptions) in place to get the correct stems in all cases. This can be particularly problematic with words that end in S but are not plural, like “billiards” or “Brussels.” Additionally, this method does not help with mapping related terms such as “boat crash,” “crashed boat,” “boat accident,” etc. which would stem to “boat crash,” “crash boat,” and “boat acci.”

Lemmatization

  • Method: Lemmatization works similarly to stemming. However, instead of using a rule set for editing words by removing letters to arrive at a stem, lemmatization attempts to map the term to its most simple dictionary form, such as WordNet, and return a canonical “lemma” of the word. A crude way to think about lemmatization is just simplifying a word. Here’s an API to check out.
  • Benefits: This method often works better than stemming. Terms like “ship,” “shipped,” and “ships” are all mapped to “ship” by this method, while “shipping” or “shipper,” which are terms that have distinct meaning despite the same stem, are retained. You can create an array of “lemma” from phrases which can be compared to other phrases resolving word order issues. This proved to be a more reliable method for grouping variations than stemming.
  • Limitations: As with many of the methods, context for mapping related terms can be difficult. Lemmatization can provide better filters for context, but to do so generally relies on identifying the word form (noun, adjective, etc) to appropriately map to a root term. Given the inconsistency of the user-generated content, it is inaccurate to assume all words are in adjective form (describing a product), or noun form (the product itself). This inconsistency can present wild results. For example, “strip socks” could be intended as as a tag for socks with a strip of color on them, such as as “striped socks,” or it could be “stripper socks” or some other leggings that would be a match only found if there other products and tags to compare for context. Additionally, it doesn’t create associations between all related words, just textual derivatives, so you are still seeking out a canonical between mailman, courier, shipper, etc.

Jaccard index

  • Method: The Jaccard index is a similarity coefficient measured by Intersection over Union. Now, don’t run off just yet, it is actually quite straightforward.

    Imagine you had two piles with 3 marbles in each: Red, Green, and Blue in the first, Red, Green and Yellow in the second. The “Intersection” of these two piles would be Red and Green, since both piles have those two colors. The “Union” would be Red, Green, Blue and Yellow, since that is the complete list of all the colors. The Jaccard index would be 2 (Red and Green) divided by 4 (Red, Green, Blue, and Yellow). Thus, the Jaccard index of these two piles would be .5. The higher the Jaccard index, the more similar the two sets.
    So what does this have to do with tags? Well, imagine we have two tags: “ocean” and “sea.” We can get a list of all of the products that have the tag “ocean” and “sea.” Finally, we get the Jaccard index of those two sets. The higher the score, the more related they are. Perhaps we find that 70% of the products with the tag “ocean” also have the tag “sea”; we now know that the two are fairly well-related. However, when we run the same measurement to compare “basement” or “casement,” we find that they only have a Jaccard index of .02. Even though they are very similar in terms of characters, they mean quite different things. We can rule out mapping the two terms together.

  • Benefits: The greatest benefit of using the Jaccard index is that it allows us to find highly related tags which may have absolutely no textual characteristics in common, and are more likely to have an overly similar or duplicate results set. While most of the the metrics we have considered so far help us find “good” or “bad” tags, the Jaccard index helps us find “related” tags without having to do any complex machine learning.
  • Limitations: While certainly useful, the Jaccard index methodology has its own problems. The biggest issue we ran into had to do with tags that were used together nearly all the time but weren’t substitutes of one another. For example, consider the tags “babe ruth” and his nickname, “sultan of swat.” The latter tag only occurred on products which also had the “babe ruth” tag (since this was one of his nicknames), so they had quite a high Jaccard index. However, Google doesn’t map these two terms together in search, so we would prefer to keep the nickname and not simply redirect it to “babe ruth.” We needed to dig deeper if we were to determine when we should keep both tags or when we should redirect one to another. As a standalone, this method also was not sufficient at identifying cases where a user consistently misspelled tags or used incorrect syntax, as their products would essentially be orphans without “union.”

Jaro-Winkler distance

  • Method: There are several edit distance and string similarity metrics that we used throughout this process. Edit Distance is simply some measurement of how difficult it is to change one word to another. For example, the most basic edit distance metric, Levenshtein distance, between “Russ Jones” and “Russell Jones” is 3 (you have to add “E”,”L”, and “L” to transform Russ to Russell). This can be used to help us find similar words and phrases. In our case, we used a particular edit distance measure called “Jaro-Winkler distance” which gives higher precedence to words and phrases that are similar at the beginning. For example, “Baseball” would be closer to “Baseballer” than to “Basketball” because the differences are at the very end of the term.
  • Benefits: Edit distance metrics helped us find many very similar variants of tags, especially when the variants were not necessarily misspellings. This was particularly valuable when used in conjunction with the Jaccard index metrics, because we could apply a character-level metric on top of a character-agnostic metric (i.e. one that cares about the letters in the tag and one that doesn’t).
  • Limitations: Edit distance metrics can be kind of stupid. According to Jaro-Winkler distance, “Baseball” and “Basketball” are far more related to one another than “Baseball” and “Pitcher” or “Catcher.” “Round” and “Circle” have a horrible edit distance metric, while “Round” and “Pound” look very similar. Edit distance simply cannot be used in isolation to find similar tags.

Keyword Planner grouping

  • Method: While Google’s choice to combine similar keywords in Keyword Planner has been problematic for predicting traffic, it has actually offered us a new method to identify highly related terms. Whenever two tags share identical metrics from Google Keyword Planner (average monthly traffic, historical traffic, CPC, and competition), we can conclude that there is an increased chance the two are related to one another.
  • Benefits: This method is extremely useful for acronyms (which are particularly difficult to detect). While Google groups together COO and Chief Operating Officer, you can imagine that standard methods like those outlined above might have problems detecting the relationship.
  • Limitations: The greatest drawback for this methodology was that it created numerous false positives among less popular terms. There are just too many keywords which have an annual search volume average of 10, are searched 10 times monthly, and have a CPC and competition of 0. Thus, we had to limit the use of this methodology to more popular terms where there were only a handful of matches.

Wikipedia disambiguation

  • Method: Many of the methods above are great for grouping similar/related terms, but do not provide a high-confidence method for determining the “master” term or phrase to represent a grouping of related/duplicate terms. While considerations can be made for testing all tags against an English language model, the lack of pop culture references and phrases makes it unreliable. To do this effectively, we found Wikipedia to be a trusted source for identifying the proper spelling, tense, formatting, and word order for any given tag. For example, if users tagged a product as “Lord of the Rings,” “LOTR,” and “The Lord of the Rings,” it can be difficult to determine which tag should be preferred (certainly we don’t need all 3). If you search Wikipedia for these terms, you will see that they redirect you to the page titled “The Lord of the Rings.” In many cases, we can trust their canonical variant as the “good tag.” Please note that we don’t encourage scraping any website or violating their terms of use. Wikipedia does offer an export of their entire database that can be used for research purposes.
  • Benefits: When a tag could be mapped to a Wikipedia entry, this method proved to be a highly effective at providing validation that a tag had potential value, or creating a point of reference for related tags. If the Wikipedia community felt a tag or tag phrase was important enough to have an article dedicated to it, then the tag was more likely to be a valuable term vs. random entry or keyword stuffing by the user. Further, the methodology allows for grouping related terms without any bias on word order. Doing a search on Wikipedia creates a search results page (“pontoon boats”), or redirects you to a correction of the article (“disneyworld” becomes “Walt Disney World”). Wikipedia also tends to have entries for some pop culture references, so things that would get flagged as a misspelling, such as “lolcats,” can be vindicated by the existence of a matching Wikipedia article.
  • Limitations: While Wikipedia is effective at delivering a consistent formal tag for disambiguation, it can at times be more sterile than user-friendly. This can run counter to other signals such as CPC or traffic volume methods. For example, “pontoon boats” becomes “Pontoon (Boat)”, or “Lily” becomes “lilium.” All signals indicate the former case as the most popular, but Wikipedia disambiguation suggests the latter to be the correct usage. Wikipedia also contains entries for very broad terms, like each number, year, letter, etc. so simply applying a rule that any Wikipedia article is an allowed tag would continue to contribute to tag sprawl problems.

K-means clustering with word vectors

  • Method: Finally, we attempted to transform the tags into a subset of more meaningful tags using word embeddings and k-means clustering. Generally, the process involved transforming the tags into tokens (individual words), then refining by part-of-speech (noun, verb, adjective), and finally lemmatizing the tokens (“blue shirts” becomes “blue shirt”). From there, we transformed all the tokens into a custom Word2Vec embedding model based on adding the vectors of each resulting token array. We created a label array and a vector array of each tag in the dataset, then ran k-means with 10 percent of the total count of the tags as the value for number of centroids. At first we tested on 30,000 tags and obtained reasonable results.
    Once k-means had completed, we pulled all of the centroids and obtained their nearest relative from the custom Word2Vec model, then we assigned the tags to their centroid category in the main dataset.
    Tag Tokens Tag Pos Tag Lemm. Categorization
    ['beach', 'photographs'] [('beach', 'NN'), ('photographs', 'NN')] ['beach', 'photograph'] beach photo
    ['seaside', 'photographs'] [('seaside', 'NN'), ('photographs', 'NN')] ['seaside', 'photograph'] beach photo
    ['coastal', 'photographs'] [('coastal', 'JJ'), ('photographs', 'NN')] ['coastal', 'photograph'] beach photo
    ['seaside', 'photographs'] [('seaside', 'NN'), ('photographs', 'NN')] ['seaside', 'photograph'] beach photo
    ['seaside', 'posters'] [('seaside', 'NN'), ('posters', 'NNS')] ['seaside', 'poster'] beach photo
    ['coast', 'photographs'] [('coast', 'NN'), ('photographs', 'NN')] ['coast', 'photograph'] beach photo
    ['beach', 'photos'] [('beach', 'NN'), ('photos', 'NNS')] ['beach', 'photo'] beach photo

    The Categorization column above was the centroid selected by Kmeans. Notice how it handled the matching of “seaside” to “beach” and “coastal” to “beach.”

  • Benefits: This method seemed to do a good job of finding associations between the tags and their categories that were more semantic than character-driven. “Blue shirt” might be matched to “clothing.” This was obviously not possible without the semantic relationships found within the vector space.
  • Limitations: Ultimately, the chief limitation that we encountered was trying to run k-means on the full two million tags while ending up with 200,000 categories (centroids). Sklearn for Python allows for multiple concurrent jobs, but only across the initialization of the centroids, which in this case was 11 — meaning that even if you ran on a 60-core processor, the number of concurrent jobs was limited by the number of initialization, which in this case, was again 11. We tried PCA (principal component analysis) to reduce the vector sizes (300 to 10) but the results were overall poor. Finally, because embeddings are generally built based on probabilistic closeness of terms in the corpus on which they were trained, there were matches that you could understand why they matched, but would obviously not have been the correct category (eg “19th century art” was picked as a category for “18th century art”). Finally, context matters and the word embeddings obviously suffer from understanding the difference between “duck” (the animal) and “duck” (the action).

Bringing it all together

Using a combination of the methods above, we were able to develop a series of methodology confidence scores that could be applied to any tag in our dataset, generating a heuristic for how to consider each tag going forward. These were case-level strategies to determine the appropriate methodology. We denoted these as follows:

  • Good Tags: This mostly started as our “do not touch” list of terms which already received traffic from Google. After some confirmation exercises, the list was expanded to include unique terms with rankings potential, commercial appeal, and unique product sets to deliver to customers. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry and
    2. Tag + product has estimated search traffic and
    3. Tag has CPC value then
    4. Mark as “Good Tag”
  • Okay Tags: This represents terms that we would like to retain associated with products and their descriptions, as they could be used within the site to add context to a page, but do not warrant their own indexable space. These tags were mapped to be redirected or canonicaled to a “master,” but still included on a page for topical relevancy, natural language queries, long-tail searches, etc. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry but
    2. Tag + product has no search volume
    3. Vector tag matches a “Good Tag”
    4. Mark as “Okay Tag” and redirect to “Good Tag”
  • Bad Tags to Remap: This grouping represents bad tags that were mapped to a replacement. These tags would literally be deleted and replaced with a corrected version. These were most often misspellings or terms discovered through stemming/lemmatization/etc. where a dominant replacement was identified. For example, a heuristic for this category might look like this:
    1. If tag is not identical to either Wikipedia or vector space and
    2. Tag + product has no search volume
    3. Tag has no volume
    4. Tag Wikipedia entry matches a “Good Tag”
    5. Mark as “Bad Tag to Remap”
  • Bad Tags to Remove: These are tags that were flagged as bad tags that could not be related to a good tag. Essentially, these needed to be removed from our database completely. This final group represented the worst of the worst in the sense that the existence of the tag would likely be considered a negative indicator of site quality. Considerations were made for character length of tags, lack of Wikipedia entries, inability to map to word vectors, no previous traffic, no predicted traffic or CPC value, etc. In many cases, these were nonsense phrases.

All together, we were able to reduce the number of tags by 87.5%, consolidating the site down to a reasonable, targeted, and useful set of tags which properly organized the corpus without wasting either crawl budget or limiting user engagement.

Conclusions: Advanced white hat SEO

It was nearly nine years ago that a well-known black hat SEO called out white hat SEO as being simple, stale, and bereft of innovation. He claimed that “advanced white hat SEO” was an oxymoron — it simply did not exist. I was proud at the time to respond to his claims with a technique Hive Digital was using which I called “Second Page Poaching.” It was a great technique, but it paled in comparison to the sophistication of methods we now see today. I never envisioned either the depth or breadth of technical proficiency which would develop within the white hat SEO community for dealing with unique but persistent problems facing webmasters.

I sincerely doubt most of the readers here will have the specific tag sprawl problem described above. I’d be lucky if even a few of you have run into it. What I hope is that this post might disabuse us of any caricatures of white hat SEO as facile or stagnant and inspire those in our space to their best work.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

How to Fix Crawl Errors in Google Search Console

Posted by Joe.Robison

A lot has changed in the five years since I first wrote about what was Google Webmaster Tools, now named Google Search Console. Google has unleashed significantly more data that promises to be extremely useful for SEOs. Since we’ve long since lost sufficient keyword data in Google Analytics, we’ve come to rely on Search Console more than ever. The “Search Analytics” and “Links to Your Site” sections are two of the top features that did not exist in the old Webmaster Tools.

While we may never be completely satisfied with Google’s tools and may occasionally call their bluffs, they do release some helpful information (from time to time). To their credit, Google has developed more help docs and support resources to aid Search Console users in locating and fixing errors.

Despite the fact that some of this isn’t as fun as creating 10x content or watching which of your keywords have jumped in the rankings, this category of SEO is still extremely important.

Looking at it through Portent’s epic visualization of how Internet marketing pieces fit together, fixing crawl errors in Search Console fits squarely into the “infrastructure” piece:

If you can develop good habits and practice preventative maintenance, weekly spot checks on crawl errors will be perfectly adequate to keep them under control. However, if you fully ignore these (pesky) errors, things can quickly go from bad to worse.

Crawl Errors layout

One change that has evolved over the last few years is the layout of the Crawl Errors view within Search Console. Search Console is divided into two main sections: Site Errors and URL Errors.

Categorizing errors in this way is pretty helpful because there’s a distinct difference between errors at the site level and errors at the page level. Site-level issues can be more catastrophic, with the potential to damage your site’s overall usability. URL errors, on the other hand, are specific to individual pages, and are therefore less urgent.

The quickest way to access Crawl Errors is from the dashboard. The main dashboard gives you a quick preview of your site, showing you three of the most important management tools: Crawl Errors, Search Analytics, and Sitemaps.

You can get a quick look at your crawl errors from here. Even if you just glance at it daily, you’ll be much further ahead than most site managers.

1. Site Errors

The Site Errors section shows you errors from your website as a whole. These are the high-level errors that affect your site in its entirety, so don’t skip these.

In the Crawl Errors dashboard, Google will show you these errors for the last 90 days.

If you have some type of activity from the last 90 days, your snippet will look like this:

If you’ve been 100% error-free for the last 90 days with nothing to show, it will look like this:

That’s the goal — to get a “Nice!” from Google. As SEOs we don’t often get any validation from Google, so relish this rare moment of love.

How often should you check for site errors?

In an ideal world you would log in daily to make sure there are no problems here. It may get monotonous since most days everything is fine, but wouldn’t you kick yourself if you missed some critical site errors?

At the extreme minimum, you should check at least every 90 days to look for previous errors so you can keep an eye out for them in the future — but frequent, regular checks are best.

We’ll talk about setting up alerts and automating this part later, but just know that this section is critical and you should be 100% error-free in this section every day. There’s no gray area here.

A) DNS Errors

What they mean

DNS errors are important — and the implications for your website if you have severe versions of these errors is huge.

DNS (Domain Name System) errors are the first and most prominent error because if the Googlebot is having DNS issues, it means it can’t connect with your domain via a DNS timeout issue or DNS lookup issue.

Your domain is likely hosted with a common domain company, like Namecheap or GoDaddy, or with your web hosting company. Sometimes your domain is hosted separately from your website hosting company, but other times the same company handles both.

Are they important?

While Google states that many DNS issues still allow Google to connect to your site, if you’re getting a severe DNS issue you should act immediately.

There may be high latency issues that do allow Google to crawl the site, but provide a poor user experience.

A DNS issue is extremely important, as it’s the first step in accessing your website. You should take swift and violent action if you’re running into DNS issues that prevent Google from connecting to your site in the first place.

How to fix

  1. First and foremost, Google recommends using their Fetch as Google tool to view how Googlebot crawls your page. Fetch as Google lives right in Search Console.

    If you’re only looking for the DNS connection status and are trying to act quickly, you can fetch without rendering. The slower process of Fetch and Render is useful, however, to get a side-by-side comparison of how Google sees your site compared to a user.

  2. Check with your DNS provider. If Google can’t fetch and render your page properly, you’ll want to take further action. Check with your DNS provider to see where the issue is. There could be issues on the DNS provider’s end, or it could be worse.
  3. Ensure your server displays a 404 or 500 error code. Instead of having a failed connection, your server should display a 404 (not found) code or a 500 (server error) code. These codes are more accurate than having a DNS error.

Other tools

  • ISUP.me – Lets you know instantly if your site is down for everyone, or just on your end.
  • Web-Sniffer.net – shows you the current HTTP(s) request and response header. Useful for point #3 above.

B) Server Errors

What they mean

A server error most often means that your server is taking too long to respond, and the request times out. The Googlebot that’s trying to crawl your site can only wait a certain amount of time to load your website before it gives up. If it takes too long, the Googlebot will stop trying.

Server errors are different than DNS errors. A DNS error means the Googlebot can’t even lookup your URL because of DNS issues, while server errors mean that although the Googlebot can connect to your site, it can’t load the page because of server errors.

Server errors may happen if your website gets overloaded with too much traffic for the server to handle. To avoid this, make sure your hosting provider can scale up to accommodate sudden bursts of website traffic. Everybody wants their website to go viral, but not everybody is ready!

Are they important?

Like DNS errors, a server error is extremely urgent. It’s a fundamental error, and harms your site overall. You should take immediate action if you see server errors in Search Console for your site.

Making sure the Googlebot can connect to the DNS is an important first step, but you won’t get much further if your website doesn’t actually show up. If you’re running into server errors, the Googlebot won’t be able to find anything to crawl and it will give up after a certain amount of time.

How to fix

In the event that your website is running fine at the time you encounter this error, that may mean there were server errors in the past Though this error may have been resolved for now, you should still make some changes to prevent it from happening again.

This is Google’s official direction for fixing server errors:

“Use Fetch as Google to check if Googlebot can currently crawl your site. If Fetch as Google returns the content of your homepage without problems, you can assume that Google is generally able to access your site properly.”

Before you can fix your server errors issue, you need to diagnose specifically which type of server error you’re getting, since there are many types:

  • Timeout
  • Truncated headers
  • Connection reset
  • Truncated response
  • Connection refused
  • Connect failed
  • Connect timeout
  • No response

Addressing how to fix each of these is beyond the scope of this article, but you should reference Google Search Console help to diagnose specific errors.

C) Robots failure

A Robots failure means that the Googlebot cannot retrieve your robots.txt file, located at [yourdomain.com]/robots.txt.

What they mean

One of the most surprising things about a robots.txt file is that it’s only necessary if you don’t want Google to crawl certain pages.

From Search Console help, Google states:

“You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file — not even an empty one. If you don’t have a robots.txt file, your server will return a 404 when Googlebot requests it, and we will continue to crawl your site. No problem.”

Are they important?

This is a fairly important issue. For smaller, more static websites without many recent changes or new pages, it’s not particularly urgent. But the issue should still be fixed.

If your site is publishing or changing new content daily, however, this is an urgent issue. If the Googlebot cannot load your robots.txt, it’s not crawling your website, and it’s not indexing your new pages and changes.

How to fix

Ensure that your robots.txt file is properly configured. Double-check which pages you’re instructing the Googlebot to not crawl, as all others will be crawled by default. Triple-check the all-powerful line of “Disallow: /” and ensure that line DOES NOT exist unless for some reason you do not want your website to appear in Google search results.

If your file seems to be in order and you’re still receiving errors, use a server header checker tool to see if your file is returning a 200 or 404 error.

What’s interesting about this issue is that it’s better to have no robots.txt at all than to have one that’s improperly configured. If you have none at all, Google will crawl your site as usual. If you have one returning errors, Google will stop crawling until you fix this file.

For being only a few lines of text, the robots.txt file can have catastrophic consequences for your website. Make sure you’re checking it early and often.

2. URL Errors

URL errors are different from site errors because they only affect specific pages on your site, not your website as a whole.

Google Search Console will show you the top URL errors per category — desktop, smartphone, and feature phone. For large sites, this may not be enough data to show all the errors, but for the majority of sites this will capture all known problems.

Tip: Going crazy with the amount of errors? Mark all as fixed.

Many site owners have run into the issue of seeing a large number of URL errors and getting freaked out. The important thing to remember is a) Google ranks the most important errors first and b) some of these errors may already be resolved.

If you’ve made some drastic changes to your site to fix errors, or believe a lot of the URL errors are no longer happening, one tactic to employ is marking all errors as fixed and checking back up on them in a few days.

When you do this, your errors will be cleared out of the dashboard for now, but Google will bring the errors back the next time it crawls your site over the next few days. If you had truly fixed these errors in the past, they won’t show up again. If the errors still exist, you’ll know that these are still affecting your site.

A) Soft 404

A soft 404 error is when a page displays as 200 (found) when it should display as 404 (not found).

What they mean

Just because your 404 page looks like a 404 page doesn’t mean it actually is one. The user-visible aspect of a 404 page is the content of the page. The visible message should let users know the page they requested is gone. Often, site owners will have a helpful list of related links the users should visit or a funny 404 response.

The flipside of a 404 page is the crawler-visible response. The header HTTP response code should be 404 (not found) or 410 (gone).

A quick refresher on how HTTP requests and responses look:

Image source: Tuts Plus

If you’re returning a 404 page and it’s listed as a Soft 404, it means that the header HTTP response code does not return the 404 (not found) response code. Google recommends “that you always return a 404 (not found) or a 410 (gone) response code in response to a request for a non-existing page.”

Another situation in which soft 404 errors may show up is if you have pages that are 301 redirecting to non-related pages, such as the home page. Google doesn’t seem to explicitly state where the line is drawn on this, only making mention of it in vague terms.

Officially, Google says this about soft 404s:

“Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic.”

Although this gives us some direction, it’s unclear when it’s appropriate to redirect an expired page to the home page and when it’s not.

In practice, from my own experience, if you’re redirecting large amounts of pages to the home page, Google can interpret those redirected URLs as soft 404s rather than true 301 redirects.

Conversely, if you were to redirect an old page to a closely related page instead, it’s unlikely that you’d trigger the soft 404 warning in the same way.

Are they important?

If the pages listed as soft 404 errors aren’t critical pages and you’re not eating up your crawl budget by having some soft 404 errors, these aren’t an urgent item to fix.

If you have crucial pages on your site listed as soft 404s, you’ll want to take action to fix those. Important product, category, or lead gen pages shouldn’t be listed as soft 404s if they’re live pages. Pay special attention to pages critical to your site’s moneymaking ability.

If you have a large amount of soft 404 errors relative to the total number of pages on your site, you should take swift action. You can be eating up your (precious?) Googlebot crawl budget by allowing these soft 404 errors to exist.

How to fix

For pages that no longer exist:

  • Allow to 404 or 410 if the page is gone and receives no significant traffic or links. Ensure that the server header response is 404 or 410, not 200.
  • 301 redirect each old page to a relevant, related page on your site.
  • Do not redirect broad amounts of dead pages to your home page. They should 404 or be redirected to appropriate similar pages.

For pages that are live pages, and are not supposed to be a soft 404:

  • Ensure there is an appropriate amount of content on the page, as thin content may trigger a soft 404 error.
  • Ensure the content on your page doesn’t appear to represent a 404 page while serving a 200 response code.

Soft 404s are strange errors. They lead to a lot of confusion because they tend to be a strange hybrid of 404 and normal pages, and what is causing them isn’t always clear. Ensure the most critical pages on your site aren’t throwing soft 404 errors, and you’re off to a good start!

B) 404

A 404 error means that the Googlebot tried to crawl a page that doesn’t exist on your site. Googlebot finds 404 pages when other sites or pages link to that non-existent page.

What they mean

404 errors are probably the most misunderstood crawl error. Whether it’s an intermediate SEO or the company CEO, the most common reaction is fear and loathing of 404 errors.

Google clearly states in their guidelines:

“Generally, 404 errors don’t affect your site’s ranking in Google, so you can safely ignore them.”

I’ll be the first to admit that “you can safely ignore them” is a pretty misleading statement for beginners. No — you cannot ignore them if they are 404 errors for crucial pages on your site.

(Google does practice what it preaches, in this regard — going to google.com/searchconsole returns a 404 instead of a helpful redirect to google.com/webmasters)

Distinguishing between times when you can ignore an error and when you’ll need to stay late at the office to fix something comes from deep review and experience, but Rand offered some timeless advice on 404s back in 2009:

“When faced with 404s, my thinking is that unless the page:

A) Receives important links to it from external sources (Google Webmaster Tools is great for this),
B) Is receiving a substantive quantity of visitor traffic,
and/or
C) Has an obvious URL that visitors/links intended to reach

It’s OK to let it 404.”

The hard work comes in deciding what qualifies as important external links and substantive quantity of traffic for your particular URL on your particular site.

Annie Cushing also prefers Rand’s method, and recommends:

“Two of the most important metrics to look at are backlinks to make sure you don’t lose the most valuable links and total landing page visits in your analytics software. You may have others, like looking at social metrics. Whatever you decide those metrics to be, you want to export them all from your tools du jour and wed them in Excel.”

One other thing to consider not mentioned above is offline marketing campaigns, podcasts, and other media that use memorable tracking URLs. It could be that your new magazine ad doesn’t come out until next month, and the marketing department forgot to tell you about an unimportant-looking URL (example.com/offer-20) that’s about to be plastered in tens thousands of magazines. Another reason for cross-department synergy.

Are they important?

This is probably one of the trickiest and simplest problems of all errors. The vast quantity of 404s that many medium to large sites accumulate is enough to deter action.

404 errors are very urgent if important pages on your site are showing up as 404s. Conversely, like Google says, if a page is long gone and doesn’t meet our quality criteria above, let it be.

As painful as it might be to see hundreds of errors in your Search Console, you just have to ignore them. Unless you get to the root of the problem, they’ll continue showing up.

How to fix 404 errors

If your important page is showing up as a 404 and you don’t want it to be, take these steps:

  1. Ensure the page is published from your content management system and not in draft mode or deleted.
  2. Ensure the 404 error URL is the correct page and not another variation.
  3. Check whether this error shows up on the www vs non-www version of your site and the http vs https version of your site. See Moz canonicalization for more details.
  4. If you don’t want to revive the page, but want to redirect it to another page, make sure you 301 redirect it to the most appropriate related page.

In short, if your page is dead, make the page live again. If you don’t want that page live, 301 redirect it to the correct page.

How to stop old 404s from showing up in your crawl errors report

If your 404 error URL is meant to be long gone, let it die. Just ignore it, as Google recommends. But to prevent it from showing up in your crawl errors report, you’ll need to do a few more things.

As yet another indication of the power of links, Google will only show the 404 errors in the first place if your site or an external website is linking to the 404 page.

In other words, if I type in your-website-name.com/unicorn-boogers, it won’t show up in your crawl errors dashboard unless I also link to it from my website.

To find the links to your 404 page, go to your Crawl Errors > URL Errors section:

Then click on the URL you want to fix:

Search your page for the link. It’s often faster to view the source code of your page and find the link in question there:

It’s painstaking work, but if you really want to stop old 404s from showing up in your dashboard, you’ll have to remove the links to that page from every page linking to it. Even other websites.

What’s really fun (not) is if you’re getting links pointed to your URL from old sitemaps. You’ll have to let those old sitemaps 404 in order to totally remove them. Don’t redirect them to your live sitemap.

C) Access denied

Access denied means Googlebot can’t crawl the page. Unlike a 404, Googlebot is prevented from crawling the page in the first place.

What they mean

Access denied errors commonly block the Googlebot through these methods:

  • You require users to log in to see a URL on your site, therefore the Googlebot is blocked
  • Your robots.txt file blocks the Googlebot from individual URLs, whole folders, or your entire site
  • Your hosting provider is blocking the Googlebot from your site, or the server requires users to authenticate by proxy

Are they important?

Similar to soft 404s and 404 errors, if the pages being blocked are important for Google to crawl and index, you should take immediate action.

If you don’t want this page to be crawled and indexed, you can safely ignore the access denied errors.

How to fix

To fix access denied errors, you’ll need to remove the element that’s blocking the Googlebot’s access:

  • Remove the login from pages that you want Google to crawl, whether it’s an in-page or popup login prompt
  • Check your robots.txt file to ensure the pages listed on there are meant to be blocked from crawling and indexing
  • Use the robots.txt tester to see warnings on your robots.txt file and to test individual URLs against your file
  • Use a user-agent switcher plugin for your browser, or the Fetch as Google tool to see how your site appears to Googlebot
  • Scan your website with Screaming Frog, which will prompt you to log in to pages if the page requires it

While not as common as 404 errors, access denied issues can still harm your site’s ranking ability if the wrong pages are blocked. Be sure to keep an eye on these errors and rapidly fix any urgent issues.

D) Not followed

What they mean

Not to be confused with a “nofollow” link directive, a “not followed” error means that Google couldn’t follow that particular URL.

Most often these errors come about from Google running into issues with Flash, Javascript, or redirects.

Are they important?

If you’re dealing with not followed issues on a high-priority URL, then yes, these are important.

If your issues are stemming from old URLs that are no longer active, or from parameters that aren’t indexed and just an extra feature, the priority level on these is lower — but you should still analyze them.

How to fix

Google identifies the following as features that the Googlebot and other search engines may have trouble crawling:

  • JavaScript
  • Cookies
  • Session IDs
  • Frames
  • DHTML
  • Flash

Use either the Lynx text browser or the Fetch as Google tool, using Fetch and Render, to view the site as Google would. You can also use a Chrome add-on such as User-Agent Switcher to mimic Googlebot as you browse pages.

If, as the Googlebot, you’re not seeing the pages load or not seeing important content on the page because of some of the above technologies, then you’ve found your issue. Without visible content and links to crawl on the page, some URLs can’t be followed. Be sure to dig in further and diagnose the issue to fix.

For parameter crawling issues, be sure to review how Google is currently handling your parameters. Specify changes in the URL Parameters tool if you want Google to treat your parameters differently.

For not followed issues related to redirects, be sure to fix any of the following that apply:

  • Check for redirect chains. If there are too many “hops,” Google will stop following the redirect chain
  • When possible, update your site architecture to allow every page on your site to be reached from static links, rather than relying on redirects implemented in the past
  • Don’t include redirected URLs in your sitemap, include the destination URL

Google used to include more detail on the Not Followed section, but as Vanessa Fox detailed in this post, a lot of extra data may be available in the Search Console API.

Other tools

E) Server errors & DNS errors

Under URL errors, Google again lists server errors and DNS errors, the same sections in the Site Errors report. Google’s direction is to handle these in the same way you would handle the site errors level of the server and DNS errors, so refer to those two sections above.

They would differ in the URL errors section if the errors were only affecting individual URLs and not the site as a whole. If you have isolated configurations for individual URLs, such as minisites or a different configuration for certain URLs on your domain, they could show up here.


Now that you’re the expert on these URL errors, I’ve created this handy URL error table that you can print out and tape to your desktop or bathroom mirror.

Conclusion

I get it — some of this technical SEO stuff can bore you to tears. Nobody wants to individually inspect seemingly unimportant URL errors, or conversely, have a panic attack seeing thousands of errors on your site.

With experience and repetition, however, you will gain the mental muscle memory of knowing how to react to the errors: which are important and which can be safely ignored. It’ll be second nature pretty soon.

If you haven’t already, I encourage you to read up on Google’s official documentation for Search Console, and keep these URLs handy for future questions:

We’re simply covering the Crawl Errors section of Search Console. Search Console is a data beast on its own, so for further reading on how to make best use of this tool in its entirety, check out these other guides:

Google has generously given us one of the most powerful (and free!) tools for diagnosing website errors. Not only will fixing these errors help you improve your rankings in Google, they help provide a better user experience to your visitors, and help meet your business goals faster.

Your turn: What crawl errors issues and wins have you experienced using Google Search Console?

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

Google News Now Will Crawl & Index Images Hosted Off Your Domain Name

Google News updated their image crawl process to include images hosted off the publishers domain.

The post Google News Now Will Crawl & Index Images Hosted Off Your Domain Name appeared first on Search Engine Land.



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in Latest NewsComments Off


Advert