XOVI Experts

Index cleanup – spring cleaning for Google Index

Florian Elbers | November 3, 2015

Thorough on-page optimisation usually begins with an index cleanup. That is important because we want to provide Google only with those subpages for the index which give users a relevant and satisfactory result to their search queries. Content that typically is not (particularly) meaningful include internal duplicate content (e.g. pages with URL parameters such as print version of articles) and external (identical content on several own domains), pagination, error pages or non-HMTL file types, which we will discuss later. Security-related content, of course, also does not belong in the index.

By cleaning up the index, we optimise at the same time the individual resources that Google provides for crawling each domain – the so-called crawl budget. For every crawl, Google uses resources, e.g. electricity and thus money. Through a targeted control of indexing, we can make it clear to Google which subpages are important to us. These can be crawled more frequently, resulting in greater relevance.

How to make your own relevant subpages clear to Google is relatively easy. Flat site architecture helps (see the article by Moz) as does submitting an XML sitemap of the key URLs of your own domain to the search engine.

Identifying rubbish in index queries using the search operator “site”

But how do we identify those subpages that are already in the index and should not be there? The easiest way is through a search query using the operator “site”: e.g. site:xovi.de. Here we are shown the number of Google-indexed subpages of our domain and the individual URLs including their search result snippets. Be careful, though: the display of the number of pages found is scaled up from a certain size and is not 100% accurate.

Typical queries for an index cleanup are:

  • All URLs of the entire domain: site:domain.de
  • All URLs of the subdomain www: site:www.domain.de
  • All URLs except those of the subdomain www: site:domain.de -site:www.domain.de
  • All URLs of the domain in /test/ folders: site:domain.de/test/
  • All URLs of the domain ending in .html: site:domain.de filetype:html
  • All URLs of the domain with “Hello, World!” in the title: site:domain.de intitle:“hallo welt„
  • All URLs of the domain with “Hello, World!” in the text: site:domain.de intext:“hallo welt„

What is nice about the results displayed is that Google roughly sorts the subpages by relevance – important URLs are displayed on the first pages, non-relevant pages or index rubbish will be found on the last pages. Here is a trick to get to the end quickly: just put the parameter &start=990 at the end of the URL in the search query, e.g. https://www.google.de/search?q=site%3Adomain.de&start=990.

For large websites, it is advisable to construct the queries in a series or restrict them to sections of the site, since Google does not actually display all URLs as a search result. A typical sequence for finding all subdomains of your website would be:

  1. Query the URLs without the known www. subdomain: site:domain.de -site:www.domain.de.
  2. Refine the search query 1. with all subdomains that are displayed in the first step, e.g. site:domain.de -site:www.domain.de -site:m.domain.de -site:stage.domain.de -site:test.domain.de.
  3. Repeat step 2 until no indexed sites are shown. Then you will have found all indexed subdomains of the domain and can take care of these individually.

If there are a great deal of URLs on a subdomain, such as typically is the case with www., then it is advisable to proceed by folder:

  1. Query the URLs with the www. subdomain: site:www.domain.de
  2. Specify the search query 1. by limiting it to one of the folders that now visibly contain many URLs, e.g. site:www.domain.de/artikel/.
  3. If now, for example, all articles end in .html and you are sure that these belong in the index, it may be worthwhile to exclude these to reveal documents in this folder which you do not wish to provide for Google: site:www.domain.de/artikel/ -filetype:html.
  4. Refine said search query by repeatedly excluding relevant and non-relevant subfolders and documents.

By using this method, you repetitiously draw closer to superfluous URLs and documents in the index and can later remove these from it. A further advantage is that at the same time, you can check what is in the snippets of the key subpages that are to remain in the index and examine them for certain criteria: Are the length of the title and description just right? Is a call-to-action available? Are all URLs appealing and self-explanatory? And is the marked-up structured data displayed as rich snippets?

How do I get rid of superfluous content in the Google Index?

To remove URLs and files from the Web index once and for all, there are several methods:

  • noindex meta tag: by inputting < meta name="robots" content="noindex" /> in the header of an HTML document, the page will be crawled, but not indexed. Furthermore, internal and external links to the page will continue to be followed (equivalent to < meta name="robots" content="noindex, follow" />).
  • Canonical tag: by using a element with the attribute rel=“canonical” in the header, subpages with identical or nearly identical content can be passed on to Google with a canonical URL for the one to be indexed, while the other is ignored.
  • 301 redirect: the 301 status code states that a page has been permanently moved to a new location. Through a 301 redirect in the .htaccess file, one or more URLs can be forwarded to another. Only the forwarding destination itself is indexed. The advantage of this method is that any external backlinks including their link juice are transferred to the new URL.
  • HTTP status code 410 (or 404): the 410 status code means “gone”. That states that the URL has been permanently removed. A 404, on the other hand, signalises that the target file was just “not found” or that it never existed. Therefore, 410 is actually the correct status code to delete content from the index that no longer exists. Both status codes work effectively, however.
  • Google Search Console: in the menu under “Google Index” is the point “delete URLs”. Here individual URLs can be entered manually and thus deleted from the index quickly and cleanly. This method is not practical, though, for removing large quantities of subpages.
  • robots.txt: the robots.txt file controls crawling, not indexing! This is still frequently confused today, so here again the tip: pages that through this file are excluded from crawling can still end up in the index. That is why robots.txt is not suitable for an index cleanup, in some cases it can even be an obstacle.

Controlling the indexing of non-HTML documents

There are also non-HTML file types that can find their way into the search engine index. The following files can generally be indexed by Google and called up through the previously mentioned file-type queries (source):

  • Adobe Flash (.swf)
  • Adobe Portable Document Format (.pdf)
  • Adobe PostScript (.ps)
  • Autodesk Design Web Format (.dwf)
  • Google Earth (.kml, .kmz)
  • GPS Exchange Format (.gpx)
  • Hancom Hanword (.hwp)
  • HTML (.htm, .html or other file extensions)
  • Microsoft Excel (.xls, .xlsx)
  • Microsoft PowerPoint (.ppt, .pptx)
  • Microsoft Word (.doc, .docx)
  • OpenOffice Presentations (.odp)
  • OpenOffice Spreadsheets (.ods)
  • OpenOffice Text (.odt)
  • Rich Text Format (.rtf, .wri)
  • Scalable Vector Graphics (.svg)
  • TeX/LaTeX (.tex)
  • Text (.txt, .text or other file extension) including source codes in major programming languages:
  • Basic source code (.bas)
  • C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
  • C# source code (.cs)
  • Java source code (.java)
  • Perl source code (.pl)
  • Python source code (.py)
  • Wireless Markup Language (.wml, .wap)
  • XML (.xml)

If you offer these type of files on your website for downloading or other use, it is recommended that these still be removed from the index. PDF files, for instance, can very well be read by Google (even links from the PDF are counted) and present good ranking opportunities. The problem: those who end up on a PDF through Google have no page navigation and therefore cannot click further on the website. These definitely no longer cater to the user for further page views or clicks on ads placed. For these types of documents, therefore, it is advisable to construct a landing page that describes the content of the file or gives a teaser for downloading the PDF. The file itself should be tagged noindex. This can be accomplished using an X-Robots-Tag since, of course, no HTML header is available. Any instruction (e.g. “noindex”) that can be used in a robots meta tag, can also be entered as an X-Robots-Tag in the HTTP header. These need to be entered in the server’s .htaccess (or httpd.conf). There are official guidelines from Google for this.

The test for success – did it work?

Success of the measures carried out for the index cleanup can be easily monitored via the Search Console (formerly Google Webmaster Tools): in the menu under “Google Index”, there is the point “indexing status”, which shows how many subpages of your website are indexed. In the extended view, even the number of robots.txt files blocked and subpages removed through previous deletion requests can be monitored. Thus, you are always in full control of your own index status.