Posted in March 2008

Love/Hate with python

Hate — some of the modules, gadzooks people get a clue:

  • html2text uses the SGML parser, so “normal” HTML will fail

After fixing it up to use HTMLParser .. it still sucks!  HTML parser is built using regular expressions, well guess what… bad HTML still fails.

Love — Had a HTML=>Text converter laying around in C++ took all of 30 minutes to wire it into python.  It works…

Iterative programming … and clustering

The good part about writing everything in a simple language, it that it makes iterative programming easy…  Time to fix and other such fun issues would be a pain in the but if I was doing all of my Clustering 101 development work in C++, though it might reduce the run time down.

That said, finally got a full run in last night, took 5 hours — ah the joys of sleep — and discovered that I’d accomplished … something!  Oh, but wait everything was high frequency terms, no real surprise.   Now, that I’ve got a working system, it time to really focus in on making meaningful token vectors as the inputs.

For your enjoyment, here’s some of the cluster leaders:

  • we, have, about
  • as, was, but
  • use, how, your
  • their, are, can
  • are, google, your
  • video, game, can
  • we, our, as
  • as, an, about
  • are, as, will
  • new, has, company
  • nbsp, are, have  [hmm... my de-HTMLing has a bug]

Clustering …

I’ve been playing with clustering my email, just a sample set of 300 or so messages.  It’s been a while since I’ve done any “NLP” work and it’s really quite fun.

Some learnings:

As the dimentionality of space increases everything starts to sit at the origin:

  • initially you might have 120 unique words in an email message, some of which are repeated multiple times (e.g. “the”, “linkedin”, …)
  • but, a second email message might have a different set of 120 words, with an overlap of 40..
  • hang on a second, those 40 words most of them are stop words on words that have an overall corpus frequency that is quite high.
  • Solutions:  Stemming and Eliminating stop words, or more…

Two useful papers…

Totally useful resource for clustering and text categorization:

http://www.softlab.ntua.gr/facilities/public/AD/Text%20Categorization/

Tagged ,

Long days and quiet mornings

Really just a blog post to test my RSS feed… but it’s a simple thought.

Good programmers are lazy programmers

Good programmers enjoy working, they just don’t like re-inventing the wheel.  There is a whole class of programmers who suffer from NIH (not invented here) who then dutifully re-invent, re-write, re-build a system.  Some of them are even so “wise” to re-use standard nomenclature in their re-inventions that when they present the XYZ project they can proclaim that they’re using ABC technology.  However they fail to mention that the ABC technology is not the industry standard but a system of their own design.  Which of course means that nobody on earth (outside of one or two engineers) can support it, modify it or change it.

Good programmers look a problem figure out how to leverage the most out of a community — open source rocks — and solve the problems that the need to solve.  Most of the time the open source system doesn’t meet 100% of the requirements, or requires a specific way of thinking about the problem.  Which is a small cost to pay in terms of overall productivity.

Example — When I started at Wink we had no MVC framework for our PHP infrastructure.   So took a quick look around and not seeing anything that was quite a match to our problem, went off and wrote something that was a closer match to our needs.  Time passed, the MVC frameworks got better, more features, etc., etc.  Ours of course was getting cruftier and was showing a general lack of thought (it started as a three day hack).   There was a very painful moment, when I said that we needed to move to a open source solution for our MVC bits…  pain …   But after a few days of having people learn a new approach to problem productivity increased and now we have a much more scalable approach to the problem.

Did I suffer from NIH at the inception?  Or can I say that I did a detailed analysis of the alternatives?   Maybe, maybe not.  At the end of the day, I’m happy moving to a standard solution if it meets 90% of my needs (if it met 100% of my needs, I wouldn’t have a job).   What drives me batty is programmers who believe that there is only one way, their way, to solve a problem and not embrace the work of the community.