Sentiment analysis on Twitter

Continue to poke away at looking at twitter data and what it means.  One of the things I think about, because everybody else is doing it, is the idea of sentiment analysis.

What’s interesting is that this posting reminded me of a Facebook “problem” — Sombody posts a note on facebook like “Broke my arm” and you end up clicking “Like” if you want to follow the conversation around this, but of course you really don’t like it.  Here’s an interesting case from the twitter data.

Kristin has of course indicated that this is “unhappy”, and her followers have concurred.  In this case I would consider it agreement with the original sentiment, rather than disagreement…

?Screen capture

Tagged ,

Los Altos has changed it’s name – according to Google

Los Altos has changed it’s name to Morgan Hill…

?

Tagged ,

Interview Puzzle System

Just finished the “greplin challenge” ( http://challenge.greplin.com/ ).  Sure it was a good distraction, but it has gotten me thinking about idea #79 for a weekend project.  Build a system that hosts Programming Challenge questions for companies, has a basic interface for validation of inputs/outputs and submits results.

Users register — have score boards of completed questions, etc, etc.

Companies can — review canidates, potentially see what other problems they’ve solved (right answer on the Company Q challenge)…   Oooh valuable…

Growth opportunities, maybe some quizes like elance to rate your skills.

Anything else?  Thrift server as the basis of the API for some challenge questions.

Human readable base conversion

Code review time… In a conversation about URL shorteners and “Coke Rewards” realized that there was a case where I needed to be able to generate safe character strings that had high reliability for input back by human beings. The typical Base62 systems where there is ambiguity between (O, o and 0) make things hard (along with all of those upper vs. lower case cases).

Here’s the quick module I put together that is a safe base converter to human readable numbers.

import types

class BaseConverter(object):
    """ Convert a number between two bases of digits, by default it's a human safe set 

    >>> v = BaseConverter(BaseConverter.BASE10)
    >>> v.to_decimal(22)
    22
    >>> v.from_decimal(22)
    '22'

    >>> v = BaseConverter(BaseConverter.BASE2)
    >>> v.to_decimal(22)
    Traceback (most recent call last):
        ...
    ValueError: character '2' not in base
    >>> v.to_decimal(10)
    2
    >>> v.to_decimal('10')
    2
    >>> v.from_decimal(22)
    '10110'

    >>> v = BaseConverter()
    >>> v.to_decimal(22)
    58
    >>> v.from_decimal(123123)
    '5h17'
    >>> v.to_decimal('5H17')
    123123

    >>> v = BaseConverter(BaseConverter.BASE62)
    >>> v.from_decimal(257938572394L)
    '4XYBxik'
    >>> v.to_decimal('4XYBxik')
    257938572394

    >>> v = BaseConverter((('Zero ',),('One ',)))
    >>> v.from_decimal(BaseConverter(BaseConverter.BASE2).to_decimal('1101'))
    'One One Zero One '

    """

    HUMAN_TABLE = (
        ('0','O','o','Q','q'),
        ('1','I','i','L','l','J','j'),
        ('2','Z','z'),
        ('3',),
        ('4',),
        ('5','S','s'),
        ('6',),
        ('7',),
        ('8',),
        ('9',),
        ('a','A',),
        ('b','B',),
        ('c','C',),
        ('d','D',),
        ('e','E',),
        ('f','F',),
        ('g','G',),
        ('h','H',),
        ('k','K',),
        ('m','M',),
        ('n','N',),
        ('p','P',),
        ('r','R',),
        ('t','T',),
        ('u','U','V','v'),
        ('w','W',),
        ('x','X',),
        ('y','Y',),
    )

    BASE2  = "01"
    BASE10 = "0123456789"
    BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    BASE16 = (
        ('0',),
        ('1',),
        ('2',),
        ('3',),
        ('4',),
        ('5',),
        ('6',),
        ('7',),
        ('8',),
        ('9',),
        ('A','a',),
        ('B','b',),
        ('C','c',),
        ('D','d',),
        ('E','e',),
        ('F','f',),
    )

    def __init__(self, digitset=HUMAN_TABLE):
        if type(digitset) in (types.StringType, types.UnicodeType) :
            self.digitset = [(v) for v in digitset]
        else :
            self.digitset = digitset

        self.base = len(self.digitset)
        self.output_map = {}

        self.output_digits = [v[0] for v in self.digitset]
        self.input_set = {}
        for idx, l in enumerate(self.digitset) :
            for k in l :
                self.input_set[k] = idx

        #print 'OUT DIGITS', self.output_digits
        #print 'INPUT SET', self.input_set

    def from_decimal(self, i):
        return self.convert(i, self.BASE10, self.output_digits)

    def to_decimal(self, s):
        return int(self.convert(s, self.input_set, self.BASE10))

    def convert(self, number, fromdigits, todigits) :
        fd    = fromdigits
        fbase = self.base
        if type(fromdigits) in (types.StringType, types.UnicodeType) :
            fbase = len(fromdigits)
            fd    = dict([(fromdigits[idx], idx) for idx in range(0,len(fromdigits))])

        return self._convert(number, fbase, fd, todigits)

    @staticmethod
    def _convert(number, fbase, fromdigits, todigits) :
        # Based on http://code.activestate.com/recipes/111286/
        number = str(number)

        if number[0] == '-':
            number = number[1:]
            neg = 1
        else:
            neg = 0

        # make an integer out of the number
        x     = 0
        #print "fbase = ", len(fromdigits)
        for digit in number :
            try :
                x = x * fbase + fromdigits[digit]
            except KeyError, e:
                raise ValueError("character '%s' not in base" % digit)

        # create the result in base 'len(todigits)'
        tbase = len(todigits)
        if x == 0:
            res = todigits[0]
        else:
            res = ""
            while x > 0:
                #print "divmod(%d, %d) = %r" % (x, tbase, divmod(x,tbase))
                x, digit = divmod(x, tbase)
                res = todigits[digit] + res
            if neg:
                res = '-' + res
        return res

binary   = BaseConverter(BaseConverter.BASE2)
hex      = BaseConverter(BaseConverter.BASE16)
base62   = BaseConverter(BaseConverter.BASE62)
human    = BaseConverter()

if __name__ == '__main__' :
    import doctest
    import random
    doctest.testmod()

Grow the Pie – don’t slice it smaller

If you’ve hung around me long enough you’ve probably heard me say, grow the pie…  Of course probably you’ve wondered what in the world I’m talking about.  The basic premise is that if you have a business don’t keep on slicing the market smaller and smaller.  The real goal of a business is to grow the marketplace.

Simple example…  My random website project of the week…

I’ve had a site parked for a long time generating $$, but wanted to throw a new “idea” at the world.  The challenge is that the new idea will of course cannibalize all of the $$ until it get’s big enough to be useful.  So, realizing that 90% of the traffic to the domain is coming for a foreign country, I’ve now created a “local” experience where the old traffic is getting the same experience and the traffic to the audience that I understand is getting the new experience.  This is effectively growing the pie, I’ve got everything I used to have plus everything that is new.

Async life and twitter

The project of the week, is something that I’ve been putting off for a very long time.  Which is to get something running on Extra that’s more than just a nothing site.  Part of the problem is that it’s a good domain name that I’ve had parked for a very long time, and it makes real $$ in parking revenue, which I would rather not endanger.

FYI — The real purpose of this post is to document the code fragment at the bottom…  Though for those other readers, take a look at how I’ve played around with some attributes of twitter feeds on extra.com [discussion threading and tagging].

One of the many ideas that I’ve had is to basically build a celeb following website, well after last week and building Notewave as a demonstration that it is possible to build an async chat style site.  My thoughts got bigger, so here’s what I needed:

  • Tornado for the async webserver, already build the django to tornado connector previously.
  • Twitter stream reader, which I have a few laying around.  Though they’re all built with the twisted framework, but while I’ve got a bunch wan’t to get out of the NIH habit so ended up using this twitter+twisted on github.
  • We’ll skip over all of the OAuth pain for some other twitter usages.

The original implementation had one process reading the twitter stream and then doing an HTTP post to the webserver to notify it that it had received a post from twitter.  This was nice, but I was now getting reports of 700K web requests, which was making my logs big and the ability to figure out if anybody was using the service just about impossible (ok, yes Google Analytics is there).  So, this mornings 5am project was to get AMQP back into the running.

I’ve used AMQP before — wrote a full scale web crawler that had a few components that utilized AMQP as the message system (it was actually AMQP + Thrift).  It worked and message passing systems are really very sweet to work with.  The challenge in this is that I had two different async frameworks (Twisted and Tornado) that I needed to get AMQP integrated with.

The Twisted one was pretty easy — there’s txAMQP which is “ok”, I’ve got a wrapper around it from my webcrawler that actually makes it easy to use.  The Tornado one was a bit more difficult, the challenge was that there is an AMQP + async python implementation, but it didn’t support Tornado as the server.  So, off to dig around through mailing lists, and other sources..

Finally found what I wanted with was AMPQ+Tornado+Pika as a unofficial port…  This worked great, except the documentation is so lacking!  Which really brings us to the point of this posting…  The quick integration for this project.

# Tornado listener
from pika.tornado_adapter import TornadoConnection
import pika
import json

class Handler(object) :
    def __init__(self, amqp=None, channel=None) :
        self.amqp    = amqp
        self.channel = channel

    def startup(self) :
        channel = self.amqp.channel()
        channel.queue_declare(queue='extra_ui', durable=False, exclusive=False, auto_delete=True)
        channel.queue_bind(queue='extra_ui', exchange='celeb.tweet_ids')

        channel.basic_consume(self.recv, 'extra_ui')

    def recv(self, channel, method, header, body) :
        v = json.loads(body)

        from views.chat import post_notify
        post_notify(v['id'])

        channel.basic_ack(delivery_tag=method.delivery_tag)

def init() :
    handler = Handler()

    amqp = TornadoConnection(pika.ConnectionParameters('localhost',
                                                       heartbeat = 10,
                                                       credentials = pika.PlainCredentials('guest', 'guest')),
                             callback=handler.startup)

    handler.amqp = amqp

Doing my quick blogpost code review, says that I could have done things much better…  Moved the ConnectionEstablishment into the Handler class, etc, etc.  But at the time I was more interested in getting things working at now 6:30am…  The big things I found was that the queue bits needed to be in the callback after the connection was established, otherwise RabbitMQ dropped things on the floor for out of order reasons.

Tagged , , ,