Performance of Python, PHP and Perl

Had a 7GB text file that I needed to run some parsing on (to prepare for a DB import).  As part of my habit I pulled out perl and whipped up a quick program to parse and generate some loadable files.  While watching it run I got to thinking about … why … why perl (yes, I know habbits are hard to break).  So while watching it run I re-wrote the program into PHP and Python.

Performance Numbers (on 5 million lines worth of the file)

 1  $ time ./split.pl  p.test           # Perl 5.8.8
 2
 3  real    0m38.577s
 4  user    0m33.554s
 5  sys     0m0.848s
 6
 7  $ time ./split.py p.test            # Python 2.4.4
 8  real    0m44.895s
 9  user    0m42.975s
10  sys     0m0.900s
11
12  $ time php split.php p.test         # PHP 5.2.6RC4
13  real    1m10.887s
14  user    0m51.251s
15  sys     0m18.677s

So, it appears that Perl is the right choice for this job.. Though python is a good second choice, but PHP 50% slower (most likely due to not having complied regular expressions).   I also might note that I’m not fond of the python if/else probably with a chained expression match, where I want to “side effect” out the results of the match — is there better syntax?

Here’s the code for you’re viewing pleasure and possible commentary.

perl

 1use strict;
 2
 3my %first;
 4
 5open(FULL, ">full.txt");
 6
 7while (<>) {
 8# __SINGLE_TOKEN__ adrianenamorado                 1
 9# __MULTI_TOKEN__ a aaron yalow        1
10    chop;
11    if (/^__MULTI_TOKEN__\s+(\S+)\s+(.*)\t?\s*(\d+)\s*$/) {
12        $first{$1} += $3;
13        print FULL  $1," ", $2, "\t", $3, "\n";
14    } elsif (/^__SINGLE_TOKEN__\s+(\S+)\s*\t?\s*(\d+)\s*$/) {
15        $first{$1} += $2;
16    } else {
17        print "Unknown: ", $_, "\n";
18    }
19}
20
21close(FULL);
22
23open(FIRST, ">first.txt");
24while (my($k, $c) = each %first) {
25    print FIRST $k,"\t",$c,"\n";
26}
27close(FIRST);

python

 1import sys, os, re
 2
 3first = dict()
 4
 5ofd = open("full.txt", 'w')
 6
 7mre = re.compile('^__MULTI_TOKEN__\s+(\S+)\s+(.*)\t?\s*(\d+)\s*$')
 8sre = re.compile('^__SINGLE_TOKEN__\s+(\S+)\s*\t?\s*(\d+)\s*$')
 9
10ifd = open(sys.argv[1], 'r')
11
12for line in ifd :
13    line = line.strip()
14    m = mre.match(line)
15    if m :
16        first[m.group(1)] = m.group(3)
17        print >> ofd, m.group(1), " ", m.group(2), "\t", m.group(3)
18    else :
19        m = sre.match(line)
20        if m :
21            first[m.group(1)] = m.group(2)
22        else :
23            print "Unknown ", line
24
25ofd.close();
26
27ofd = open("first.txt", 'w')
28for (k, c) in first.iteritems() :
29    print >> ofd, k, "\t", c
30ofd.close()

php

 1$first = array();
 2
 3$fd = fopen("full.txt", 'w');
 4$in = fopen($argv[1], 'r');
 5
 6while ($line = fgets($in)) {
 7    $line = trim($line);
 8    if (preg_match('/^__MULTI_TOKEN__\s+(\S+)\s+(.*)\t?\s*(\d+)\s*$/', $line, $m)) {
 9        $first[$m[1]] += $m[3];
10        fprintf($fd, "%s %s\t%d\n", $m[1], $m[2], $m[3]);
11    } else if (preg_match('/^__SINGLE_TOKEN__\s+(\S+)\s*\t?\s*(\d+)\s*$/', $line, $m)) {
12        $first[$m[1]] += $m[2];
13    } else {
14        print "Unknown: {$line}\n";
15    }
16}
17
18fclose($fd);
19
20$fd = fopen("first.txt", 'w');
21foreach ($first as $k => $c) {
22    fprintf($fd, "%s\t%d\n", $k, $c);
23}
24fclose($fd);