(More notes to myself)
Inspired by CommonCrawl, but their data isn’t all that I want. After building out my redback system, which I’ve now let sit and die for the last two years (it worked, but had some small scaling issued). Revised idea based on some of the CommonCrawl thinking…
Built in Tornado, with a simple way to dequeue objects from the store, running multiple async fetches. Once an object is fetched have a simple XPath that can extract secondary real-time fetch objects (e.g Images that are interesting). Store everything in the Store.
For now either Riak or Mongo – with a mapper.
Sort through all of the “non-processed” documents and extract links based on frequency.
— TODO – find reference to CommonCrawl doc.
Basic Process – The first interations will take a “long” time to iterate, but as the schedule table fills the fetcher will always be busy.
- Seed with one URL in the “todo” table
- Fetch and Store (with robots.txt check)
- Map over the data to extract more pages to fetch store these in the “todo” table, check to insure that the page isn’t fetched store in the “todo” table.
Now the “blob” table will be full on HTML (and Image) documents that can be used in any way that one wants.
Blob Table –
- Time Fetched
- Body MD5
Schedule Table –
- Time Inserted
- Reference Count (e.g. PageRank)