Google Site Operator: an Ode to Thee

Let’s go back to basics today.

Car owners typically know that there are a number of automatic safety features on a modern vehicle. There are a number of gauges and sensors that will show a driver that a taillight is out, or that the windshield wiper fluid is low. People routinely eyeball their tires or check the actual tire pressure with a tire pressure gauge so they know they’re maximizing the life of their tires and the safety of their commute.

Similarly, website owners have a few things they can do as well to do quick checks of their site. One easy way to check a site’s health in Google is via the site: operator. It’s my favorite operator by far (Advanced operators documentation). When I talked to website owners at Pubcon in Las Vegas last year, I was struck by how many people weren’t aware of it. These searches are useful to the technical and the non-technical alike.

Adding the site: operator to a search will allow one to restrict the results to a granular level. Most people use site: to restrict a search to a specific subsection of our index, such as [site:yahoo.com yankees] to only search Yahoo for information on the New York Yankees. They may not know that if one uses the site operator alone, Google will do its best to bring you back all the documents it knows about for that subsection of Google’s index. Some examples:

There are more possibilites including subsections with dynamic URLs–so it’s fun to play around with that operator.

Stuff I like about the site: operator:

  • It shows you immediately how page titles and URLs look. The nice thing about seeing all of one’s documents like this is that things you want to think about fixing will stick out at you pretty quickly. They might be page titles or documents you don’t want visible to the public. You can take things into your own hands (ideas below), or if you work with a web development professional or SEO, you can pick up the phone and ask them about what you see in these results.
  • You can look at the estimated results at the top right to gauge roughly how many pages Google knows about.
  • You can use a site: search on other sites to see how they want the world to see themselves through search engines. This could be sites you admire, your competitors, or admired competitors. :)
  • As a member of the Webspam team at Google, I occasionally see sites that are hacked or defaced by rogue enterprises that aim to put revenue-generating pages on sites without the owner’s knowledge, in the hopes that those pages “borrow” the site’s reputation to show up in Google. Most sites don’t have to worry about this. Last year I posted on this topic at SEW. Again, it’s not something to worry about too much, my point is that routine site: checks have a good chance of showing you instantly if rogue pages have been inserted into your site (paying attention to analytics reports can help here too).
  • Google’s not the only search engine to support this operator. While the implementations differ slightly, Yahoo, Live Search, and Ask all support it.

Tips on using the site: operator:

  • You can also add a negative site: operator to do additional filtering. Consider this query: [site:brianwhite.org -site:brianwhiteblog.appspot.com]. I’m asking Google, “Please give me everything you know on the domain brianwhite.org, but remove results that are on the subdomain ‘www’ on the same domain.” These combinations can be powerful, especially if you have a larger site and use subdomains. One of the joys of living in our “search era” is challenging oneself to combine operators and techniques to get interesting results.
  • Don’t panic if you see problems within your site: search, a lot of them can be fixed. There’s no guarantee that any particular URL has been seen by a user, as well–they still have to search for it and have the URL show up in results. More can be learned at the Google Webmaster Help Center.
  • Don’t worry about Supplemental Results. Supplemental results, by themselves, don’t indicate problems. My blog has about half its results marked as Supplemental right now, and I consider that a bonus. Some sites have URLs come in and out of Supplemental status on a continual basis. Google’s index is refreshed very frequently and is highly dynamic. I don’t worry about that ratio as I know that people are finding my site based on my logs and Analytics reports.

Actions you can take as a result:

  • Head to Google’s Webmaster Tools where you can get more insight on how Google crawls your site, discuss what you see with other site owners, and look for help in the documentation section.
  • If there are sections of your site you want to remove from the index quickly, Webmaster Tools has a removal tool.
  • If you have non-urgent things you’d like to clean up, you can think about cleaning up things like page titles or modifying your robots.txt and/or META tags to prevent crawling or archiving. I’m thinking of preventing Googlebot from crawling some sections of this blog based on a site: search (I’ll talk about it in an upcoming post).

I know that some webmasters and site owners have pet uses of site: operator queries. What are your favorite applications, tips, or issues with this operator? I’d love to hear about them. Also, eagle-eyed readers will have noted that there were no lyrical poems, stanzas, or verses to be had in this ode :)

Update Jun 16 2008: The Supplemental Results label went away.

This entry was posted in Google. Bookmark the permalink.

3 Responses to Google Site Operator: an Ode to Thee

  1. JohnMu says:

    Perhaps I’m jumping ahead, but I would love your insiders take on the “about” number, especially with regards to the site:-operator. Where does it come from, what does it mean, should we care or should we just look at the first 10 results and accept that there might be more (or maybe not)?

    I like to go to the last visible page of results (“&start=990″) and check the numbers, I’ve seen such gems as (numbers above 10 are made up, but the order of magnitude is correct):
    “500-507 of about 0″ (a lot of these lately)
    “500-507 of about 300″
    “500-507 of about 507″ (would be optimal)
    “500-507 of about 900″
    “500-507 of about 500’000’000″ (I love those)
    (these are with the extended results activated (“&filter=0″))

    Website owners love numbers. They love to see that their content is being treated with respect by Google :-) so they would really love to see “how many pages are currently indexed”. They’re also afraid of the supplemental index – they’d love to see “how many pages are only in the supplemental index”.

    Ah, and while you’re with the site:-operator, just like the link:-operator people need to remember that a “space” between the “:” (colon) and the domain name breaks the operator. I can’t mention it enough – http://webmastershelp.iblogget.com/2007/03/09/where-did-my-links-go/ . A lot of newer webmasters make that mistake. It would be great if Google could put an info-box up there when people do this kind of query “did you mean ‘site:domain.com’ to show the indexed pages?”.

  2. Brian says:

    John,

    My site is very small right now, so the 33 estimated pages is right on the money :) But I have seen the same as you. I can’t speculate on when we might do better at estimated results.

    Also, good point on the space between the colon and the domain name (or TLD, or domain name + subdirectory etc.).

  3. Hi,

    Just as you have stated pretty good things about “site” operator, i would like to add a little bit about “filetype” operator which is very useful for searching specific file types on Google. for example consider the following dork which i use on a daily basis to fetch text files (proxy lists) out of Google search pages.

    Dork:

    intext:80 | 8080 intext:3124 | 3128 filetype:txt

    It will show only text files in the search results which are basically proxy lists :)

    Hope i added some value to this blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>