0

I have a script in python to process a log file - it parses the values and joins them simply with a tab.

p = re.compile(
    "([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
    "worker\\(([0-9]+)\\)(?:@([^]]*))?.*\\[([0-9]+)\\] "+
    "=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
    "JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
  )

for line in sys.stdin:
  line = line.strip()
  if len(line) == 0: continue
  result = p.match(line)
      if result != None:
    print "\t".join([x if x is not None else "." for x in result.groups()])

However, the scripts behaves quite slowly and it takes a long time to process the data.

How can I achieve the same behaviour in faster way? Perl/SED/PHP/Bash/...?

Thanks

Vojtěch
  • 11,312
  • 31
  • 103
  • 173
  • 4
    Define 'slowly'. Have you determined it's the regular expression? The filesystem, the network, whatever else is part of the whole? How many lines per time unit does it process? I'm not saying it is not your (complex) regular expression, but you may want to make sure that that is your problem. – Martijn Pieters Sep 21 '12 at 12:57
  • 3
    Also, for python strings inside parenthesis, the `+` concatenation is redundant. Python automatically concatenates such strings. – Martijn Pieters Sep 21 '12 at 12:58
  • 3
    `if len(line) == 0:` can be shortened to `if len(line):` which can be shortened to `if line:` which cannot be shortened any more. – eumiro Sep 21 '12 at 13:02
  • 3
    `.*` things presumably cause a lot of backtracking, try to get rid of them (replace with `[^...]`). Also could you post an example of your input? – georg Sep 21 '12 at 13:03
  • 2
    grep/awk are the speed references for regex : http://stackoverflow.com/a/11192394/718618 – Cédric Julien Sep 21 '12 at 13:04
  • 1
    To allow people to test their answers, provide 5-10 lines of sample input. Maybe exhibit some edge cases – James Waldby - jwpat7 Sep 21 '12 at 13:04
  • Martijn, Eumiro, th435: Thanks for helpful comments! Cédric Julien: This is what I was looking for. Thanks! – Vojtěch Sep 21 '12 at 13:27

3 Answers3

2

It is hard to know without seeing your input, but it looks like your log file is made up of fields that are separated by spaces and do not contain any spaces internally. If so, you could split on whitespace first to put the individual log fields into an array. i.e.

line.split()      #Split based on whitespace

or

line.split(' ')   #Split based on a single space character

After that, use a few small regexes or even simple string operations to extract the data from the fields that you want.

It would likely be much more efficient, because the bulk of the line processing is done with a simple rule. You wouldn't have the pitfalls of potential backtracking, and you would have more readable code that is less likely to contain mistakes.

I don't know Python, so I can't write out a full code example, but that is the approach I would take in Perl.

dan1111
  • 6,576
  • 2
  • 18
  • 29
1

Im writing Perl, not Python, but recently i used this technique to parse very big logs:

  1. Divide input file to chunks (for example, FileLen/NumProcessors bytes each).
  2. Adjust start and end of every chunk to \n so you take full lines to each worker.
  3. fork() to create NumProcessors workers, each of which reading own
    bytes range from file and writes his own output file.
  4. Merge output files if needed.

Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.

Galimov Albert
  • 7,269
  • 1
  • 24
  • 50
1

In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.

http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions

"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."

If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

user1126070
  • 5,059
  • 1
  • 16
  • 15
  • He *is* using a precompiled regex. But it probably doesn't matter, since Python automatically caches regexes, same as Perl. The problem here is the regex itself, most likely those `.*`'s @thg435 [mentioned](http://stackoverflow.com/questions/12531035/fastest-way-of-processing-regexp#comment16871195_12531035) – Alan Moore Sep 21 '12 at 17:55