The Inverted Index
See this natural language processing playlist.
Web Crawling
Web crawlers are just programs that download webpages, collect all the links, and visit all those links to repeat the process. This understates the potential complexity of the process quite a bit. There are many other additions and features to be added to a web crawler.
To figure out what features a web crawler must implement you need to produce a list of characteristics for the system. This can be difficult because many characteristics interfere with one another leading to trade-offs.
Features and Characteristics
- Scalability: the web is absolutely enormous and indexing any sizable portion requires a scalable system. Having a distributed web crawler
- is a good strategy for scalability.
- Niceness: It is considered rude for a crawler to send very frequent requests to a given server, some might even block the crawler’s IP if it is running too fast
- Freshness: The web is always changing so keeping the index up to date is ideal
- Robustness: The web can have spider traps either put there maliciously to stop web crawlers or created on accident. A crawler should be resilient when finding these traps.
- Extensibility: The web is always changing and a web crawler should be able to adapt easily 1.