The good part about writing everything in a simple language, it that it makes iterative programming easy… Time to fix and other such fun issues would be a pain in the but if I was doing all of my Clustering 101 development work in C++, though it might reduce the run time down.
That said, finally got a full run in last night, took 5 hours — ah the joys of sleep — and discovered that I’d accomplished … something! Oh, but wait everything was high frequency terms, no real surprise. Now, that I’ve got a working system, it time to really focus in on making meaningful token vectors as the inputs.
For your enjoyment, here’s some of the cluster leaders:
- we, have, about
- as, was, but
- use, how, your
- their, are, can
- are, google, your
- video, game, can
- we, our, as
- as, an, about
- are, as, will
- new, has, company
- nbsp, are, have [hmm... my de-HTMLing has a bug]