Clustering …

I’ve been playing with clustering my email, just a sample set of 300 or so messages.  It’s been a while since I’ve done any “NLP” work and it’s really quite fun.

Some learnings:

As the dimentionality of space increases everything starts to sit at the origin:

  • initially you might have 120 unique words in an email message, some of which are repeated multiple times (e.g. “the”, “linkedin”, …)
  • but, a second email message might have a different set of 120 words, with an overlap of 40..
  • hang on a second, those 40 words most of them are stop words on words that have an overall corpus frequency that is quite high.
  • Solutions:  Stemming and Eliminating stop words, or more…

Two useful papers…

Totally useful resource for clustering and text categorization:

http://www.softlab.ntua.gr/facilities/public/AD/Text%20Categorization/