Chicken Scratches on Crawler

(More notes to myself)

Inspired by CommonCrawl, but their data isn’t all that I want.  After building out my redback system, which I’ve now let sit and die for the last two years (it worked, but had some small scaling issued).  Revised idea based on some of the CommonCrawl thinking…

– Fetcher

Built in Tornado, with a simple way to dequeue objects from the store, running multiple async fetches.  Once an object is fetched have a simple XPath that can extract secondary real-time fetch objects (e.g Images that are interesting).  Store everything in the Store.

– Store

For now either Riak or Mongo – with a mapper.

– Mapper

Sort through all of the “non-processed” documents and extract links based on frequency.

– TODO – find reference to CommonCrawl doc.

Basic Process – The first interations will take a “long” time to iterate, but as the schedule table fills the fetcher will always be busy.

  • Seed with one URL in the “todo” table
  • Fetch and Store (with robots.txt check)
  • Map over the data to extract more pages to fetch store these in the “todo” table, check to insure that the page isn’t fetched store in the “todo” table.

Now the “blob” table will be full on HTML (and Image) documents that can be used in any way that one wants.

Blob Table -

  • URL
  • Time Fetched
  • Body MD5
  • Headers
  • Body

Schedule Table -

  • URL
  • Time Inserted
  • Reference Count (e.g. PageRank)


BackboneJS and RequireJS configuration

One of the big challenges is getting backbonejs and requirejs working together without jumping through lots of hoops.  After reading quite a few blog posts, stackoverflow answers I finally came up with a simple canonical solution to the problem.

The assumed directory layout is something like:

           ...third party stuff like backbone, jquery, etc.etc.
  <script data-main="/static/app/config" src="/static/vendor/require.js"></script>

Cutting to the chase here’s the configuration file I use

// This is your: static/app/config.js file
    deps: ["main"],

    paths: {
        // Libraries
        jquery: "../vendor/jquery",
        underscore: "../vendor/underscore",
        backbone: "../vendor/backbone",
        handlebars: "../vendor/handlebars",
        bootstrap: "../vendor/bootstrap"

    shim: {
        jquery: {
            exports: '$'
        underscore: {
            exports: '_'
        handlebars: {
            exports: 'Handlebars'
        backbone: {
            deps: ['underscore', 'jquery'],
            exports: 'Backbone'

With that line in your HTML file and a configuration that is similar to what I’ve show you should be up and running, you can now create your main.js file that has the following simple format that you’ll be extending shortly.

// This is your: static/app/main.js file
//   - this is loaded as the deps: ["main"] in  your configuration
  // Libs

function ($, _, Backbone) {

Punctuation in Language Design is Good

It’s all about CoffeeScript (and quietly about Ruby).

Why do people feel the need to remove punctuation?  I came across the following code snippet:

class TenFarms.View extends Backbone.View
  constructor: ->
    functions = _.difference _.functions(this), _.functions(TenFarms.View.prototype)
    _.bindAll.apply _, [this, "render"].concat(functions)

What I like and dislike – in line # order

  1. This is much better than the JavaScript equivalent – hands down.
  2. Nice, define a function.
  3. The beef
    GOOD – Sure you don’t need “var”
    BAD – Why no parens around _.difference(_.functions(this), …)
  4. Again, we’re depending on whitespace to cause things to happen.
    Think about this is: “A B” a function call?  What About “A B, C D” – did I forget a comma or is that two function call “A(B, C(D))”
  5.  ”super” that’s good – better than the JavaScript version.

Bottom line, is when I’m reading Ruby or CoffeeScript I’m driven somewhat batty by when a function call happens.  All because they dropped the (…) requirement and now have some ambiguous cases.  Part of me has to wonder sometimes if the heavy emphasis on TDD is because of the adoption of such programming languages.

Template Languages

Somehow I keep on building websites, re-building or otherwise…  Sometimes I think it just practice – just like playing a piano – most of the time it doesn’t have any true purpose, but it allows me to try different techniques and approaches.  When I think of big systems, NoSQL datastores (for example) it’s really hard to just play, it takes weeks to build the base system…   Then if you want to rebuild the IO subsystem it’s pretty involved, though I’ve  been know to do this…

Now back to templates!

I like the Zend Framework in PHP, it really helps abstract you from where the mess comes in (if you live in PHP).  However, it doesn’t relieve you of the worst risk in the language which is your templates are still in PHP.  One of the biggest points that drove me crazy at Yahoo! was their input filtering policy, where they mandated that some characters had to be removed from input to prevent people from accidentally letting them flow out to the HTML.   The challenge with this approach is that now all of a sudden you can’t have “&” or “<” in an input string…  You’re masking a class of bugs with a policy.

The following three lines are three different template systems for the same function:

   <input id="name" name="nm" type="text" value="<?= htmlspecialchars($filter['nm']) ?>" autocomplete="off" />
   <input id="name" name="nm" type="text" value="{{ query.get('nm','') }}" autocomplete="off"/>
   <input id="name" name="nm" type="text" value="{{ query.nm }}" autocomplete="off"/>

These are all real lines, two of which are in production – one is demonstration.

Case #1 – You need to explicityly escape your code, thus meaning you mandate code reviews since bad things happen when people get tired of typing htmlspecialchars().  Which they do, because 30% of your code is “internal” safe variables – and you don’t need to escape…  But, of course you do – since the source data might have come in via an AJAX API that you implemented 6 months later…

Case #2 – Good, escaping by default.  But, turns out that now you’re providing sensible defaults to everything.  But, typically the default is the empty string, failure to provide yields ”None” or “null”.  More typing by rote, good but not great.

Case #3 – I like this approach.  Two big value adds – Defaults are sensible, you don’t have to worry about the kind of data you’re dealing with.  Is it an object, is it a dictionary, is it an array ( result.location.3.title ) – is it a function…  By the time you get to the template level, you really just want as sensible default and an ERROR kicked into a logfile.  Don’t throw an error if result.location only has two items.

Case #4 – Strange languages like HAML, at the end of the day HTML is the Lingua Franca between Engineering and Design… Why introduce translation.

At the end of the day, really think about how you’re abstracting things.  The goal of a good system  is the ability to get the maximum amount of work done with the minimum amount of syntax.  Since the more we type, the more mistakes happen, the more review is needed…

FYI – We’re looking at PHP, Tornado and Django (it could have been Handlebars).  Tornado makes no claims about being a full featured Template system, but until it get’s “big” it’s quick to whip things up in.  I really should wire up the 15 line Tornado<->Django connector that pulls Django templates into Tornado, but that’s another day.

How to destroy brand value

As most of us know there is that (Y!) branding at the bottom of your iPhone Weather app.  What’s funny is that fundamentally this is eroding the brand value of Yahoo!

The other day pulled up the weather on my phone, saw the forecast and said flatly “That’s is not the weather for today”.  So, the (Y!) branded experience says that you do not provide correct or accurate information.  Think about this for a minute, the next time I need weather, or any other piece of information I personally am going to dis-trust Yahoo! and look for a better information provider.

Wait – now why don’t I hold Apple to the same standard, it is an iPhone after all.   Apple has done a good job of distancing themselves from the contents of the app, or the quality of the network, and the cool factor of the device itself.  Now all I want is the ability to remove the Yahoo! Weather app and replace it with my “vendor X” weather app which I trust.

Quality vs. Quantity

Two instances of things got to me last week, in two different venues.

First – I hate Wallmart about once every two years I walk into the store and think that I might buy somethings and then walk out in disgust.  Sometimes it’s from the store across the street from the office, where it’s just plain dirty.  In this case it was I needed a duffle bag, nothing fancy just something to put a few days of clothes into.  A quick visit to the store in Carson City is in order…  Person in the store is helpful, directs me to the right place — HUGE store…  Now I stare at the bags that the have — they are out and out cheap, maybe they would last a week or two before they totally disintegrate, of course they cost next to nothing.  Instead of getting a bag that is destin to fail at some unknown moment in the near future, I walk out of the store and find another store to get something that will last a real amount of time and cost a tiny bit more.

Second – Riding the chairlift with somebody who was reminiscing about having a picnic style lunch in the Alps while skiing, then later in the day talking to somebody about eating lunch in Portillo.  Both conversations were attaching to dining in Mammoth (which is better than most mountains).   The challenge we observed is that it’s the scale at which they’re preparing food.

I could have titled this post Boutique or Bulk.

  • As a consumer I want the highest quality experience.
  • As a capitalistic society we focus on price.

As a marketplace we’re amazed at what happens when quality comes first

  • Disney -
  • Apple – the iPhone (nuf said)
  • Ritz-Carlton -