PHP Classes

File: README.txt

Recommend this page to a friend!
  Classes of Andy Pieters   Robots_txt   README.txt   Download  
File: README.txt
Role: Documentation
Content type: text/plain
Description: Usage Examples
Class: Robots_txt
Test if a URL may be crawled looking at robots.txt
Author: By
Last change:
Date: 16 years ago
Size: 1,604 bytes
 

Contents

Class file image Download
Robots exclusion standard is considered propper netiquette, so any kind of script that exhibits crawling-like behavior is expected to abide by it. The intended use of this class is to feed it a url before you intend to visit it. The class will automatically attempt to read the robots.txt file and will return a boolean value to indicate if you are allowed to visit this url. Maximum Crawl-delays and request-rates maxed-out at 60seconds. The class will block until the detected crawl-delay (or request-rate) allows visiting the url. For instance, if Crawl-delay is set to 3, the Robots_txt::urlAllowed() method will block for 3 seconds when called a second time. An internal clock is kept with the last visited time, so if the delay is already expired, the method will not block. Example usage foreach($arrUrlsToVisit as $strUrlToVisit) { if(Robots_txt::urlAllowed($strUrlToVisit,$strUserAgent)) { #visit url, do processing. . . } } The simple example above will ensure you abide by the wishes of the site owners. Note: an unofficial non-standard extension exists, that limits the times that crawlers are allowed to visit a site. I choose to ignore this extension because I feel it is unreasonable. Note: You are only *required* to specify your userAgent the first time you call the urlAllowed method, and only the first value is ever used. Example Usage var_dump(Robots_txt::urlAllowed('http://slashdot.org/','Slurp')); var_dump(Robots_txt::urlAllowed('http://slashdot.org/test','Slurp'));