Most of the time, an XML sitemap is set up using a plugin or extensions and instantly forgotten about. The assumption here is that it is working on autopilot and is running smoothly.
This is not always the case. Most sites have an XML sitemap that points crawlers towards pages that are canonicalised, return 404 errors or are set to NOINDEX. This can affect crawlability and indexation and can consume crawl budget.
[bctt tweet=”An un-optimised XML Sitemap can lead to issues with crawlability and indexation”]
What is an XML Sitemap?
An XML sitemap is essentially a list of URLs that the search engines use to easily find and crawl every page on a website. It is a website’s version of a book’s index.
How to conduct an XML Sitemap Audit
Checking the XML sitemap is a relevantly quick process
Collect the URLs
Let’s take Matthew Woodward XML as an example. Simply navigate to the XML page, in this case, it is located at sitemap_index.xml as he is using the Yoast SEO plugin, but sometimes it is just located at sitemap.xml.
Sitemap page: https://www.matthewwoodward.co.uk/sitemap_index.xml
As we can see, there are two active sitemaps, post-sitemap.xml and page-sitemap.xml.
Open both in a new tab and you should see the list of URLs for the crawler.
You can either use a scraper browser plugin to extract the URL list or just highlight the table and copy and paste the data into Excel.
To tidy up the data, remove the ‘Image’ and Last ‘Mod’ columns and delete row 1 with the table headers. You should be left with a list of all the URLs contained in the XML sitemap.
Crawl The URLs With Screaming Frog
Once you have an excel file with the complete list of URLs, copy and list and open up Screaming Frog.
You will need to change the mode in screaming frog from the default spider to list mode.
Then select the paste option to input the URL list saved on your clipboard.
Once the spider has finished the crawl, you will need to check the following columns:
- Status, for redirects and 404 errors
- Meta Robots 1, for NOINDEX tags
Here, we can see there are 10 URLs that return a 301 status code and one URL that returns a 404 page not found error. No pages in this example returned a NOINDEX tag.