Clustering …

Mar 26, 2008 · 1 min read · clustering nlp ·

I’ve been playing with clustering my email, just a sample set of 300 or so messages. It’s been a while since I’ve done any “NLP” work and it’s really quite fun.

Some learnings:

As the dimentionality of space increases everything starts to sit at the origin:

initially you might have 120 unique words in an email message, some of which are repeated multiple times (e.g. “the”, “linkedin”, …)
but, a second email message might have a different set of 120 words, with an overlap of 40..
hang on a second, those 40 words most of them are stop words on words that have an overall corpus frequency that is quite high.
Solutions: Stemming and Eliminating stop words, or more…

Two useful papers…

Using TF-IDF Anomalies to Cluster Documents on Subject Matter
Simultaneous Categorization of Text Documents and Identification of Cluster-dependent Keywords

Totally useful resource for clustering and text categorization:

http://www.softlab.ntua.gr/facilities/public/AD/Text%20Categorization/