Schaake.nu

Robots

Gepost in /Software op 29 Juni 2012
Deze blog is geschreven door Christiaan Schaake

Preface

Nowerdays the Internet consists of billions of Web pages all over the world. There are no road signs that direct visitors to your site. So if you have not enough money to start a huge advertisement campaign, you are stuck with search engines.
Search engines are used to find specific information on the Internet. Search engines are constantly crawling (looking) over the Internet and indexing millions of pages per day. You can easily add your page to the search engines work list. So within a couple of weeks the search engine will crawl over your page.
But since the search engine is using a robot or spider, which is actually a program that looks at your page and tries to get information about your site. The change that your site is indexed correctly is very slim. There are however ways to help and direct a search spider so that the changes Internet users can find your site in search engines will dramatically increase.

There are two methods of directing a search engine on your site.

  • Robots.txt
  • META tags
The robots.txt tell the search engine spider where to look for information.
The META tags help the spider to get the correct information about a specific web page.

Robots.txt

The robots.txt file is used by search engine spiders to see what they may or may not include in there search. The robots.txt file must always be located at the root of the website. You cannot make a robots.txt for a specific part of the website.
An example of a robots.txt would be my own at:
http://www.schaake.nu/robots.txt

The robots.txt consists of 2 commands. With the first command you can set a specific user agent for which the directive will be set. So only that specific user agent will look at the directive.
The second command will restrict access to a specific directory or file on the website.
Let's explain this with a sample: # Some stuff we don't want google to see
User-Agent: Googlebot
Disallow: /googlesecrets.html
Disallow: /cgi-bin

# All the other agents may also not index the cgi-bin
User-Agent: *
Disallow: /cgi-bin
With this example, all search engines will index the whole site except the /cgi-bin. But only the googlebot will also not index the /googlesecrets.html page.

Now what if we want to index the complete site, so we don't have any secrets at all. # Allow complete access
User-Agent: *
Disallow:
Or we could disallow one agent to index our site completely. # Disallow the googlebot completely
User-Agent: googlebot
Disallow: /
Note that not all search engines will look at the robots.txt file at all. Most of the big commercial search engines will look at your robots.txt file. But search engines of spammers (who are looking for email addresses) will not be stopped by the robots.txt file.

META tags

The META tags contain information about a specific web page. Most spiders will look at the META tags and use this information instead of trying to collect information about the page themselves.
The drawback of this is that when you have incorrect or outdated information in your META tags, the spider will use this information instead of looking at the page itself.

The following META tags can be used to help a spider. <META NAME="description" CONTENT="Desciption of the webpage"/> The description tag holds the title of the webpage. Keep this the same as the %lt;TITLE> tag in the page header. Some search engines will still look at the title tag instead of the description META tag!
<META NAME="keywords" CONTENT="keyword1 keyword2 keyword3"/> To help a spider to collect keywords on your site, you can include the keywords META tag. This tag contains some useful keywords Internet users can use to find your Web page. Keywords are separated with a space.
<META NAME="robot" CONTENT="index,follow"/> The robots.txt can forbid spiders to look at specific pages or complete directories. But sometimes you want some more control over the spider. The robot META tag will give you all the control you need over a spider.
The first part of the tag will tell the spider if the current page may be indexed, the second part will tell the spider if it may follow hyperlinks in the current page. Possible options are:
  • index,follow - Spiders may index the page and follow all links on the page.
  • noindex,nofollow - Spiders may not index the page and may not follow any links.
  • index,nofollow - The page may be indexed, but no links may be followed. This is very usefull for page that link to forms.
  • noindex,follow - The page may not be indexed, but the spider may follow all links. A good example would be a dynamic weblog.
There are 2 short cuts, the "ALL" and the "NONE" options. The ALL option stands for index,follow. And the NONE option stands for noindex,nofollow.
(eg. <META NAME="robot" CONTENT="all"/>) <META NAME="refresh" CONTENT="3600"/> The refresh meta tag will tell the spider to refresh the page every number of seconds. This directive could be used for internal search engines, but I would not see a reason why a public search engine would refresh indexed it's content for your specific page. It will take weeks before a search engine will visit your site again. <META NAME="revisit-after" CONTENT="30"/> This directive makes more sense. But I'm not sure if there are search engines that look at this directive. The above example tells the search engine to revisit the site after 30 days. So if a search engine normally would plan a revisit after 14 days, it can wait another 16 days to revisit your site. This really keeps the bandwidth open. <META NAME="generator" CONTENT="Microsoft Frontpage"/> This META tag tell the spider which web design tool was used to generate or design this Web page. A search engine could use this to build statics on the usage on design tools. <META NAME="language" CONTENT="nl, en"/> This META tag defines the language used on the Web page. Normally a spider will try to detect the used language itself. But with this tag you can force a specific language. <META NAME="copyright" CONTENT="Copyright 2003 Christiaan Schaake."/>
<META NAME="author" CONTENT="Christiaan Schaake"/>
These 2 META tags tell the spider who wrote the page and the copyrights of this page. A search engine could include this information in the search results.

Not all search engine will look at the META tags, so always use plain text for important parts of your site. Do not make a first welcome page that only includes a big image or a shock-wave animation. And make use of the title and alt tags!

Deze blog is getagd als Robots Website

Google
facebook