Tagged with nlp

Sentiment analysis on Twitter

Continue to poke away at looking at twitter data and what it means.  One of the things I think about, because everybody else is doing it, is the idea of sentiment analysis.

What’s interesting is that this posting reminded me of a Facebook “problem” — Sombody posts a note on facebook like “Broke my arm” and you end up clicking “Like” if you want to follow the conversation around this, but of course you really don’t like it.  Here’s an interesting case from the twitter data.

Kristin has of course indicated that this is “unhappy”, and her followers have concurred.  In this case I would consider it agreement with the original sentiment, rather than disagreement…

?Screen capture

Tagged ,

Content agregation vs. Human Editors

This idea came up in a conversation the other day with somebody… The crux of it is that more startups, toy websites or other research projects have been created over the years to aggregate blogs, twitters, lives into a single stream. Everything from facebook, friendfeed, google reader, netvibes, etc. Hey, even I wrote one (feedini.com — it’s probably not running at the moment).

The problem is that none of these experiences can compare to a newspaper in a few ways:

  • Focus
    We’ve all read and re-read the same story over 37 different blogs or news outlets, which gets to be a tedious pain.
  • Breadth
    If I’m interested in “food” I’m interested in food, reading a single blog isn’t going to make me feel like I’m reading the food section of the newspaper.
  • Diversity
    I don’t read the sports section, but ever now and then I find that because it’s there I see something that’s worth reading.

This got me thinking, everybody has been trying to make a NLP/Machine learning system that creates the “right” content for you.  What happens if that’s the wrong idea!

My quick proposal, is that somebody should create an editors desk.  Think of it as bloglines meets about.com — I can create a personal newspaper that I can focus and drive anyway I want….  Create editorial content or republish content quickly easily, etc.   With the added benefit that I could format it similar to a news site (rather than a blog).

This way my readership could look at a collection of stories that I was publishing — technically republishing (think AP wire).  I could establish the editorial voice of the site and have conversations/discussions that we close to the readership (think Hacker News).  Then make sure there is a solid set of editor tools — this is where my inner geek get off, this is where the NLP/Machine learning could help an editor focus in on important content to republish on their site.

Thus if you wanted to make a food portal you create “food.example.com” select a layout from the inventory — ala bloglines templates — then feed it a collection of interesting feeds and it would suggest a bunch more that were simlar… Walla you’re now 80% of the way to making and editorial portal.

Maybe it’s just democratizing Huffington Post…

Tagged , ,

Clustering …

I’ve been playing with clustering my email, just a sample set of 300 or so messages.  It’s been a while since I’ve done any “NLP” work and it’s really quite fun.

Some learnings:

As the dimentionality of space increases everything starts to sit at the origin:

  • initially you might have 120 unique words in an email message, some of which are repeated multiple times (e.g. “the”, “linkedin”, …)
  • but, a second email message might have a different set of 120 words, with an overlap of 40..
  • hang on a second, those 40 words most of them are stop words on words that have an overall corpus frequency that is quite high.
  • Solutions:  Stemming and Eliminating stop words, or more…

Two useful papers…

Totally useful resource for clustering and text categorization:

http://www.softlab.ntua.gr/facilities/public/AD/Text%20Categorization/

Tagged ,