Synergy is Fun

One could say I’ve got too many projects with too much free time, but another way to look at things is that constant exploration can put a smile on your face.  I’ve been working on a few projects:

GearTracker (http://geartracker.com)

This is my big project, which I really need some product marketing help on.  I’ve gotten most of the infrastructure in place, but need somebody to come in an help me put the finishing 20% on top to make the first time experience right and probably catch where I’ve just been wrong headed about the thinking.

Note – Really, could use help…

Spiro (https://github.com/koblas/spiro)

One of the things that I keep on needing is a “midscale” web crawler. Would like to be able to crawl 100M pages without much thought along with obey all of robots.txt and be nice to sites, etc.etc.  What’s interesting is the crawler is the easy part, it’s being nice to sites that is difficult.

Distal (https://github.com/koblas/distal)

This is where synergies come from!  Originally I started Spiro with the idea of having a nice Bootstrap and Backbone UI, but it was klunky.  In one of those moments of frustration I started looking at EmberJS, which is way overkill… Ok, not really overkill, but not what I wanted.

So after looking at EmberJS I was inspired to produce a view framework that really tried to mimic the best parts, but keep the Backbone spirit alive.  Thus Distal was born.

What’s great is that I’ve had a chance to use it in GearTracker – which is where I’ve started to understand the whole client side UI development model, then take it back to Spiro to use again and break some of the ideas that I had in GearTracker.

It’s all about Synergies!

Ideas build on ideas, tools build on tools, those moments where you want a big old toggle switch to turn off your crawler can end up being 15 lines of JS (+15 of boilerplate) and 10 lines of Python.  And you’ve got a working control on a page, which does a real time refresh of the server to enable and disable things.

You did one of those “big deal” – but it is really a big deal – It’s not like I had to build a bunch of templates, do a full server reload of the page.  It’s a button it operates like a button.

Chicken Scratches on Crawler

(More notes to myself)

Inspired by CommonCrawl, but their data isn’t all that I want.  After building out my redback system, which I’ve now let sit and die for the last two years (it worked, but had some small scaling issued).  Revised idea based on some of the CommonCrawl thinking…

– Fetcher

Built in Tornado, with a simple way to dequeue objects from the store, running multiple async fetches.  Once an object is fetched have a simple XPath that can extract secondary real-time fetch objects (e.g Images that are interesting).  Store everything in the Store.

– Store

For now either Riak or Mongo – with a mapper.

– Mapper

Sort through all of the “non-processed” documents and extract links based on frequency.

– TODO – find reference to CommonCrawl doc.

Basic Process – The first interations will take a “long” time to iterate, but as the schedule table fills the fetcher will always be busy.

  • Seed with one URL in the “todo” table
  • Fetch and Store (with robots.txt check)
  • Map over the data to extract more pages to fetch store these in the “todo” table, check to insure that the page isn’t fetched store in the “todo” table.
  • REPEAT

Now the “blob” table will be full on HTML (and Image) documents that can be used in any way that one wants.

Blob Table -

  • URL
  • Time Fetched
  • Body MD5
  • Headers
  • Body

Schedule Table -

  • URL
  • Time Inserted
  • Reference Count (e.g. PageRank)

 

BackboneJS and RequireJS configuration

One of the big challenges is getting backbonejs and requirejs working together without jumping through lots of hoops.  After reading quite a few blog posts, stackoverflow answers I finally came up with a simple canonical solution to the problem.

The assumed directory layout is something like:

static/
       app/
           config.js
           main.js
       vendor/
           ...third party stuff like backbone, jquery, etc.etc.
  <script data-main="/static/app/config" src="/static/vendor/require.js"></script>

Cutting to the chase here’s the configuration file I use

// This is your: static/app/config.js file
//
require.config({
    deps: ["main"],

    paths: {
        // Libraries
        jquery: "../vendor/jquery",
        underscore: "../vendor/underscore",
        backbone: "../vendor/backbone",
        handlebars: "../vendor/handlebars",
        bootstrap: "../vendor/bootstrap"
    },

    shim: {
        jquery: {
            exports: '$'
        },
        underscore: {
            exports: '_'
        },
        handlebars: {
            exports: 'Handlebars'
        },
        backbone: {
            deps: ['underscore', 'jquery'],
            exports: 'Backbone'
        }
    }
});

With that line in your HTML file and a configuration that is similar to what I’ve show you should be up and running, you can now create your main.js file that has the following simple format that you’ll be extending shortly.

// This is your: static/app/main.js file
//   - this is loaded as the deps: ["main"] in  your configuration
require([
  // Libs
  "jquery",
  "underscore",
  "backbone",
],

function ($, _, Backbone) {
});

Punctuation in Language Design is Good

It’s all about CoffeeScript (and quietly about Ruby).

Why do people feel the need to remove punctuation?  I came across the following code snippet:

class TenFarms.View extends Backbone.View
  constructor: ->
    functions = _.difference _.functions(this), _.functions(TenFarms.View.prototype)
    _.bindAll.apply _, [this, "render"].concat(functions)
    super

What I like and dislike – in line # order

  1. This is much better than the JavaScript equivalent – hands down.
  2. Nice, define a function.
  3. The beef
    GOOD – Sure you don’t need “var”
    BAD – Why no parens around _.difference(_.functions(this), …)
  4. Again, we’re depending on whitespace to cause things to happen.
    Think about this is: “A B” a function call?  What About “A B, C D” – did I forget a comma or is that two function call “A(B, C(D))”
  5.  ”super” that’s good – better than the JavaScript version.

Bottom line, is when I’m reading Ruby or CoffeeScript I’m driven somewhat batty by when a function call happens.  All because they dropped the (…) requirement and now have some ambiguous cases.  Part of me has to wonder sometimes if the heavy emphasis on TDD is because of the adoption of such programming languages.

Template Languages

Somehow I keep on building websites, re-building or otherwise…  Sometimes I think it just practice – just like playing a piano – most of the time it doesn’t have any true purpose, but it allows me to try different techniques and approaches.  When I think of big systems, NoSQL datastores (for example) it’s really hard to just play, it takes weeks to build the base system…   Then if you want to rebuild the IO subsystem it’s pretty involved, though I’ve  been know to do this…

Now back to templates!

I like the Zend Framework in PHP, it really helps abstract you from where the mess comes in (if you live in PHP).  However, it doesn’t relieve you of the worst risk in the language which is your templates are still in PHP.  One of the biggest points that drove me crazy at Yahoo! was their input filtering policy, where they mandated that some characters had to be removed from input to prevent people from accidentally letting them flow out to the HTML.   The challenge with this approach is that now all of a sudden you can’t have “&” or “<” in an input string…  You’re masking a class of bugs with a policy.

The following three lines are three different template systems for the same function:

   <input id="name" name="nm" type="text" value="<?= htmlspecialchars($filter['nm']) ?>" autocomplete="off" />
   <input id="name" name="nm" type="text" value="{{ query.get('nm','') }}" autocomplete="off"/>
   <input id="name" name="nm" type="text" value="{{ query.nm }}" autocomplete="off"/>

These are all real lines, two of which are in production – one is demonstration.

Case #1 – You need to explicityly escape your code, thus meaning you mandate code reviews since bad things happen when people get tired of typing htmlspecialchars().  Which they do, because 30% of your code is “internal” safe variables – and you don’t need to escape…  But, of course you do – since the source data might have come in via an AJAX API that you implemented 6 months later…

Case #2 – Good, escaping by default.  But, turns out that now you’re providing sensible defaults to everything.  But, typically the default is the empty string, failure to provide yields ”None” or “null”.  More typing by rote, good but not great.

Case #3 – I like this approach.  Two big value adds – Defaults are sensible, you don’t have to worry about the kind of data you’re dealing with.  Is it an object, is it a dictionary, is it an array ( result.location.3.title ) – is it a function…  By the time you get to the template level, you really just want as sensible default and an ERROR kicked into a logfile.  Don’t throw an error if result.location only has two items.

Case #4 – Strange languages like HAML, at the end of the day HTML is the Lingua Franca between Engineering and Design… Why introduce translation.

At the end of the day, really think about how you’re abstracting things.  The goal of a good system  is the ability to get the maximum amount of work done with the minimum amount of syntax.  Since the more we type, the more mistakes happen, the more review is needed…

FYI – We’re looking at PHP, Tornado and Django (it could have been Handlebars).  Tornado makes no claims about being a full featured Template system, but until it get’s “big” it’s quick to whip things up in.  I really should wire up the 15 line Tornado<->Django connector that pulls Django templates into Tornado, but that’s another day.

How to destroy brand value

As most of us know there is that (Y!) branding at the bottom of your iPhone Weather app.  What’s funny is that fundamentally this is eroding the brand value of Yahoo!

The other day pulled up the weather on my phone, saw the forecast and said flatly “That’s is not the weather for today”.  So, the (Y!) branded experience says that you do not provide correct or accurate information.  Think about this for a minute, the next time I need weather, or any other piece of information I personally am going to dis-trust Yahoo! and look for a better information provider.

Wait – now why don’t I hold Apple to the same standard, it is an iPhone after all.   Apple has done a good job of distancing themselves from the contents of the app, or the quality of the network, and the cool factor of the device itself.  Now all I want is the ability to remove the Yahoo! Weather app and replace it with my “vendor X” weather app which I trust.