Boring background story:
My banking reports come in .pdf and .csv formats, but the older ones are only available to me as pdf's. I wanted to save the data in the same format as the newer ones to make it easier searchable, but the pdf's were protected, and after fiddling with unlockers and pdf-to-text converters, I ended up with terribly formatted files.
I have files where each line is a transaction (like this: 23.12 22.12.09 Verfügung Geldautomat\t63050000 / 9000481400\tGA NR00002317 BLZ63050000 0\t22.12/14.17UHR ESELSBERGW EUR 50,00\t-50,00
)
Replacing the data fields with numbers yields the following order and separators:
1 2 3\t7 / 6\t5\t4\t8
But I want it in this format (the fields 0, 9 and 10, which are not present in the original data are static)
"0";"1";"3";"4";"5";"6";"7";"8";"9";"10"
this is my current approach (without the I/O part)
def readtrans(line):
d1, d2, rest = line.split(' ', 2)
d3, rest, d5, d4, d8 = rest.split('\t')
d7, d6 = rest.split(' / ')
return [d1, d2, d3, d4, d5, d6, d7, d8]
unfortunately it crashed at line 3 of the first file, because fields 5 and 6 are empty for certain values of field 3. After adding an if-clause to work around this, the script advanced to line 5 only to crash again, because field 4 might also contain Tabs. I could work around this one too, but I took it as a cue to look for a more flexible solution.
Most of the time, when i need to extract data from text, i look at the delimiters and split()
accordingly. It might not be very efficient but its faster than looking up the regex syntax which I rarely use and repeatedly forget. Is it a viable approach in this situation or would regex be better suited? Can regex even handle this task and if yes, would it still be readable? How would you solve it?
Edit: its true, that i will never use this code again, (this was my solution by the way) but this is a very common problem
def readtrans(line):
d1, d2, rest = line.split(' ', 2)
if rest[0] == 'A':
d3, d7, d4, d8 = rest.split('\t')
d5 = ''
d6 = ''
else:
d3, d7d6, d5, d4d8 = rest.split('\t', 3)
d7, d6 = d7d6.split(' / ')
rest = d4d8.split('\t')
d8 = rest[-1]
d4 = ' '.join(rest[:-1])
return [d1, d2, d3, d4, d5, d6, d7, d8]
After thinking how to rephrase my question, I realized that it was basically a duplicate of this one Convert string to variables (like format(), but in reverse)
Armed with the new knowledge, i fabricated this short pattern which parses my example correctly
import re
example = '23.12 22.12.09 Verfügung Geldautomat\t63050000 / 9000481400\tGA NR00002317 BLZ63050000 0\t22.12/14.17UHR ESELSBERGW EUR 50,00\t-50,00'
x = re.search(r'(\S+) (\S+) ([\S| ]+)\t(\S+) / (\S+)\t([\S| ]+)\t([\S| ]+)\t([\S| ]+)', example)
print x.groups()
>>>('23.12',
'22.12.09',
'Verf\xc3\xbcgung Geldautomat',
'63050000',
'9000481400',
'GA NR00002317 BLZ63050000 0',
'22.12/14.17UHR ESELSBERGW EUR 50,00',
'-50,00')
the key was using re.groups()