I’ve been playing with clustering my email, just a sample set of 300 or so messages. It’s been a while since I’ve done any “NLP” work and it’s really quite fun.
As the dimentionality of space increases everything starts to sit at the origin:
- initially you might have 120 unique words in an email message, some of which are repeated multiple times (e.g. “the”, “linkedin”, …)
- but, a second email message might have a different set of 120 words, with an overlap of 40..
- hang on a second, those 40 words most of them are stop words on words that have an overall corpus frequency that is quite high.
- Solutions: Stemming and Eliminating stop words, or more…
Two useful papers…
- Using TF-IDF Anomalies to Cluster Documents on Subject Matter
- Simultaneous Categorization of Text Documents and Identification of Cluster-dependent Keywords
Totally useful resource for clustering and text categorization: