Reproducible instalation

The learning curve on chef is a lot steeper than I wish.

Can’t say that I’ve got it running like I would.

But, I can now provision a devbox on AWS in 5 minutes with basic packages and users in a sane state.

It took 2 days of fiddling.

Now back to developing systems onto of basic environments.

Things left to learn –

  • How to get private keys onto the machine so git integration is smooth
  • How to checkout private repos with chef onto the boxes
  • The big challenge is how to “auto” configure an environments with AWS Elastic IPs or other tidbits.


Read More

Twitter is the Forrest

Was reminded this morning.

I check my Facebook page for updates, but I never check my twitter page. I’ll watch the twitter feed during the day.

Thinking it’s like a tree in the woods, if it falls and nobody’s watching do you care? If it’s important it’s on Facebook.

Read More

Fizz Buzz

As seen on a thread on Hacker News about Fizz Buzz and “interesting” functional ways to solve it.  Realized that there are many ways to boil the ocean, but this feels like a nice compromise between data/program separation and language.

Note this is using a “Bazz” variant of the FizzBuzz problem where Bazz is printed every 7 numbers.

Read More

MLB Needs the Lance Rule

Thinking about the Oakland A win last night got me thinking about MLB.  The A have had a great season ending, while the Giants have fumbled around at the end.  But, wait they had to suspend Melky for drug use…  Could their be a correlation?

Sorry, I’m not going to do a bunch of statistics, but more a thought experiment.

We all know the tale of Lance and his Tour jerseys, but MLB doesn’t have the same remediation.  Melky get fined and suspended for a bunch of games, but no retroactive penalties.    So, if  you have one or more players on your team who can get you to a playoff seat, you happily turn a blind eye.  Sure there might be some $$ penalties, but if you get into the playoffs that’s probably a was in terms of revenue you can earn.

What does that really mean, as a team you will turn a blind eye to the use of performance enhancing drugs as long as it helps you get to the playoffs.  The player may be suspended, but may show up in time for a few more games, or you could trade them…

I would propose that MLB adopt the “Lance Rule” which is that if you’re found to have a player using PE drugs that you retro-actively forfeit your last 7 weeks of games.  That pretty much would ensure that the teams goals match the outcome.

Read More

Synergy is Fun

One could say I’ve got too many projects with too much free time, but another way to look at things is that constant exploration can put a smile on your face.  I’ve been working on a few projects:

GearTracker (

This is my big project, which I really need some product marketing help on.  I’ve gotten most of the infrastructure in place, but need somebody to come in an help me put the finishing 20% on top to make the first time experience right and probably catch where I’ve just been wrong headed about the thinking.

Note – Really, could use help…

Spiro (

One of the things that I keep on needing is a “midscale” web crawler. Would like to be able to crawl 100M pages without much thought along with obey all of robots.txt and be nice to sites, etc.etc.  What’s interesting is the crawler is the easy part, it’s being nice to sites that is difficult.

Distal (

This is where synergies come from!  Originally I started Spiro with the idea of having a nice Bootstrap and Backbone UI, but it was klunky.  In one of those moments of frustration I started looking at EmberJS, which is way overkill… Ok, not really overkill, but not what I wanted.

So after looking at EmberJS I was inspired to produce a view framework that really tried to mimic the best parts, but keep the Backbone spirit alive.  Thus Distal was born.

What’s great is that I’ve had a chance to use it in GearTracker – which is where I’ve started to understand the whole client side UI development model, then take it back to Spiro to use again and break some of the ideas that I had in GearTracker.

It’s all about Synergies!

Ideas build on ideas, tools build on tools, those moments where you want a big old toggle switch to turn off your crawler can end up being 15 lines of JS (+15 of boilerplate) and 10 lines of Python.  And you’ve got a working control on a page, which does a real time refresh of the server to enable and disable things.

You did one of those “big deal” – but it is really a big deal – It’s not like I had to build a bunch of templates, do a full server reload of the page.  It’s a button it operates like a button.

Read More

Chicken Scratches on Crawler

(More notes to myself)

Inspired by CommonCrawl, but their data isn’t all that I want.  After building out my redback system, which I’ve now let sit and die for the last two years (it worked, but had some small scaling issued).  Revised idea based on some of the CommonCrawl thinking…

— Fetcher

Built in Tornado, with a simple way to dequeue objects from the store, running multiple async fetches.  Once an object is fetched have a simple XPath that can extract secondary real-time fetch objects (e.g Images that are interesting).  Store everything in the Store.

— Store

For now either Riak or Mongo – with a mapper.

— Mapper

Sort through all of the “non-processed” documents and extract links based on frequency.

— TODO – find reference to CommonCrawl doc.

Basic Process – The first interations will take a “long” time to iterate, but as the schedule table fills the fetcher will always be busy.

  • Seed with one URL in the “todo” table
  • Fetch and Store (with robots.txt check)
  • Map over the data to extract more pages to fetch store these in the “todo” table, check to insure that the page isn’t fetched store in the “todo” table.

Now the “blob” table will be full on HTML (and Image) documents that can be used in any way that one wants.

Blob Table –

  • URL
  • Time Fetched
  • Body MD5
  • Headers
  • Body

Schedule Table –

  • URL
  • Time Inserted
  • Reference Count (e.g. PageRank)


Read More