Extract multiple data fields from a non-uniformly formatted string

Question

Boring background story:

My banking reports come in .pdf and .csv formats, but the older ones are only available to me as pdf's. I wanted to save the data in the same format as the newer ones to make it easier searchable, but the pdf's were protected, and after fiddling with unlockers and pdf-to-text converters, I ended up with terribly formatted files.

I have files where each line is a transaction (like this: 23.12 22.12.09 Verfügung Geldautomat\t63050000 / 9000481400\tGA NR00002317 BLZ63050000 0\t22.12/14.17UHR ESELSBERGW EUR 50,00\t-50,00)

Replacing the data fields with numbers yields the following order and separators:

1 2 3\t7 / 6\t5\t4\t8

But I want it in this format (the fields 0, 9 and 10, which are not present in the original data are static)

"0";"1";"3";"4";"5";"6";"7";"8";"9";"10"

this is my current approach (without the I/O part)

def readtrans(line):
    d1, d2, rest = line.split(' ', 2)
    d3, rest, d5, d4, d8 = rest.split('\t')
    d7, d6 = rest.split(' / ')
    return [d1, d2, d3, d4, d5, d6, d7, d8]

unfortunately it crashed at line 3 of the first file, because fields 5 and 6 are empty for certain values of field 3. After adding an if-clause to work around this, the script advanced to line 5 only to crash again, because field 4 might also contain Tabs. I could work around this one too, but I took it as a cue to look for a more flexible solution.

Most of the time, when i need to extract data from text, i look at the delimiters and split() accordingly. It might not be very efficient but its faster than looking up the regex syntax which I rarely use and repeatedly forget. Is it a viable approach in this situation or would regex be better suited? Can regex even handle this task and if yes, would it still be readable? How would you solve it?

Edit: its true, that i will never use this code again, (this was my solution by the way) but this is a very common problem

def readtrans(line):
    d1, d2, rest = line.split(' ', 2)
    if rest[0] == 'A':
        d3, d7, d4, d8 = rest.split('\t')
        d5 = ''
        d6 = ''
    else:  
        d3, d7d6, d5, d4d8 = rest.split('\t', 3)
        d7, d6 = d7d6.split(' / ')
        rest = d4d8.split('\t')
        d8 = rest[-1]
        d4 = ' '.join(rest[:-1])
    return [d1, d2, d3, d4, d5, d6, d7, d8]

After thinking how to rephrase my question, I realized that it was basically a duplicate of this one Convert string to variables (like format(), but in reverse)

Armed with the new knowledge, i fabricated this short pattern which parses my example correctly

import re
example = '23.12 22.12.09 Verfügung Geldautomat\t63050000 / 9000481400\tGA NR00002317 BLZ63050000 0\t22.12/14.17UHR ESELSBERGW EUR 50,00\t-50,00'
x = re.search(r'(\S+) (\S+) ([\S| ]+)\t(\S+) / (\S+)\t([\S| ]+)\t([\S| ]+)\t([\S| ]+)', example)
print x.groups()
>>>('23.12',
'22.12.09',
'Verf\xc3\xbcgung Geldautomat',
'63050000',
'9000481400',
'GA NR00002317 BLZ63050000 0',
'22.12/14.17UHR ESELSBERGW EUR 50,00',
'-50,00')

the key was using re.groups()

Can you update your question with some examples of lines that cause your code to crash? Also, StackOverflow is probably swallowing your `\t` characters, so it would be nice if you could explicitly indicate where those go. — senshin, Jan 04 '14 at 23:41
Don't forget try/except is good for handling expected errors. — John Mee, Jan 04 '14 at 23:43
@senshin: The errors are from split producing more or less elements than expected, as I said, i can fix those but Im more interested in the general approach of extracting multiple substrings with different patterns than this specific problem. — DenDenDo, Jan 05 '14 at 00:07

score 1 · Answer 1 · answered Jan 05 '14 at 00:07

I'm going to make a couple of assumptions: 1) You'll probably never use this code again 2) There are only a few possible formats

I wouldn't bother with working out a RE for this, as it doesn't need to be so robust. (see assumption 1).

I would probably try to figure out some way to determine which format the particular line I'm reading is using. Then with some if statements, send it to through the appropriate delimiting steps to get the fields in the order you want. (see assumption 2).

I quickly came up with an example, you'll obviously need to change quite a bit to make it work for your case, but you get the idea. The hardest part may be figuring out a way to determine which decoder to use...I used the 'location of the tab' in my case.

def decoder1(line):
    parts = line.split("\t")
    d1, d2 = parts[0].split(",")
    d3, d4, d5, d6, d7, d8, d9 = parts[1].split(",")
    return [d1, d2, d3, d4, d5, d6, d7, d8, d9]


def decoder2(line):
    parts = line.split("\t")
    d1 = parts[0]
    d2, d3, d4, d5, d6, d7, d8, d9 = parts[1].split(",")
    return [d1, d2, d3, d4, d5, d6, d7, d8, d9]


def decoder3(line):
    parts = line.split("\t")
    d1, d2, d3, d4, d5, d6, d7 = parts[0].split(",")
    d8, d9 = parts[1].split(",")

    return [d1, d2, d3, d4, d5, d6, d7, d8, d9]


if __name__ =="__main__":
    lines = [
            "1,2\t3,4,5,6,7,8,9",
            "1\t2,3,4,5,6,7,8,9",
            "1,2,3,4,5,6,7\t8,9"
            ]

    for line in lines:
        tablocation = len((line.split("\t")[0]).split(","))
        if tablocation == 2:
            res = decoder1(line)
        elif tablocation == 1:
            res = decoder2(line)
        elif tablocation == 7:
            res = decoder3(line)
        else:
            print "Must be a new format for %s" %line
            res = "NA"
        print res

If you've got more than a few 'decoder options' then it might be worth spending the time to develop some REs. But without knowing all the possible variations you have, it's going to be hard to provide much more help than what I've shown in the approach above.

score 1 · Answer 2 · answered Jan 05 '14 at 01:17

There's a bit of noise in your question but here's what I think you're asking:

How do I specify multiple delimiters to split on, some of which may be more than one character long?

The answer to that is use re.split():

s = '1 2 3\t7 / 6\t5\t4\t8'

import re

re.split(r'\s/\s|\s|\t',s)
Out[13]: ['1', '2', '3', '7', '6', '5', '4', '8']

You can rearrange the order into your final output as you see fit.

Note: usually in these multiple-delimiter problems you can arbitrarily specify the order of the tokens to split on. Not so here.

re.split(r'\s|\t|\s/\s',s)
Out[14]: ['1', '2', '3', '7', '/', '6', '5', '4', '8']

You need to look for \s/\s before you look for just \s, since the latter is a substring of the former.

Extract multiple data fields from a non-uniformly formatted string

2 Answers2