Jump to content

Web Robots (aka spiders, wanderers, crawlers)


RandomBox

Recommended Posts

The following is mainly provided for those who have (non web-commerce) personal sites and want by invitation-only visitors... but only if you don't want rankings by the likes of Google/Yahoo/etc. Maybe this only applies to me, then I apologize for even posting this:

The Web Robots Pages Web Robots are programs that traverse the Web automatically. Some people call them Web Wanderers, Crawlers, or Spiders. The following pages have further information about these Web Robots. The Web Robots FAQ Frequently Asked Questions about Web Robots from Web users, Web authors, and Robot implementors. Robots Exclusion Find out what you can do to direct robots that visit your Web site. A List of Robots A database of currently known robots , with descriptions and contact details.The Robots Mailing List An archived mailing list for discussion of technical aspects of designing, building, and operating Web Robots. Articles and Papers Background reading for people interested in Web Robots. Related Sites Some references to other sites that concern Web Robots.
I just call them 'bots' and me no likey>>
How do I prevent robots scanning my site?The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server: User-agent: * Disallow: /but its easy to be more selective than that.Where do I find out how /robots.txt files work?You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example: # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logsThe first two lines, starting with '#', specify a commentThe first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.Two common errors: * Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'. * You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec)
If you are arachnophobic like myself than you may continue reading here:
WWW Robots Related Sites Bot Spot "The Spot for All Bots on the Net". The Web Robots Pages Martijn Koster's pages on robots, specifically robot exclusion. Japanese Search Engines This is a comprehensive index for searching, submitting, and navigating using Japanese search engines. Search Engine Watch A site with information about many search engines, including comparisons. Some information is available to subscribers only. RoboGen RoboGen is a visual editor for Robot Exclusion Files; it allows one to create agent rules by logging onto your FTP server and selecting files and directories.
But whatever you do, I beg of you to NOT ever kill real-spiders >> EVER! If you find them inside your house, gingerly escort them outside but otherwise "Live and Let Live" is a great motto to follow, when it comes to real spiders!
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...