Crawlers

Web graphs, crawlers, meta tags, robot.txt, and more!

Web graphs

Web graphs represent the structure of the internet, showing how web pages are connected.

Applications

Web graphs are used to

compute the page rank of WWW pages
compute the personalized page rank
detecting pages with similar content/topic, though through graph-theoretical properties only, like co-citation
identifying hubs and authorities in the web for Hits algorithm

Web Crawlers

Web crawlers, also known as spider or spiderbot, are an internet bot that systematically browses on the web for the specific purpose of finding information. Web search engines specifically use web crawling to update their web content, indices of other sites' web content, copy pages for processing for a search engine, etc.

How does it work?

A web crawler firsts starts with a list of URLs to visit. These initial URLs are called seeds. The crawler visits these URLs by communicating with web servers that respond to those URLs.
The crawler then identifies all the hyperlinks from the web pages and adds them to a list of URLs to visit, called the crawl frontier. URLs from the frontier are visited recursively according to a set of policies.
If the crawler is trying to archive websites, it copies and saves the information it collects. The archives are usually stored in such a way they can be viewed, read, and navigated they way they would be read on the actual web, preserved as snapshots.

Web crawlers often have difficulty avoiding duplicate content due to the large number of possible URLs. There are many different combinations of HTTP GET parameters, many of which refer to the same content. For example, when something can be sorted, the URL link may change, which creates a problem for crawlers. Crawlers then have to sort through all these different combinations of URLs to retrieve unique content.

Meta Tags

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page.

Robot.txt

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.