2

I have this really irritating problem resulting from a dodgy PDF digitisation.

Anyway, a series of rows with different columns, ideally, would be represented like this:

Code  Cost  Quantity
ABC  45.00  4
FED  60.00  5
GHK  30.00  5

With regex it is easy to split these into rows and then get each individual column.

However, I find one particularly annoying bit of text always comes out like this instead:

Code  Cost  Quantity
ABC FED GHK   45.00 60.00 30.00  4 5 5

I cannot for the life of me figure out how to get the regex to separate out each of these overlapped rows, as in the first example. Positive lookaheads can get me some of the way, but what typically happens is that ill get ABC 45.00 4 and then FED 45.00 4, the lookaheads I've built don't iterate through all of the separate columns.

My suspicion is I could use a named pattern or something, matching the first set:

(?>(?<match1>((?>\s|\b)\w{3}\s).+\s+\s(\d+\.\d{2})\s.*\s+\s(\d{1})\s.*))

and then somehow Reuse that capture group, iterating it.

Sticking a positive look ahead only iterates the first group, so I'm obviously doing something stupid:

https://regex101.com/r/Uxx8bZ/1

In theory I could separate out the rows some other way (e.g. for every big space, that's one column) but it seems like this should be possible.

Help appreciated!

1 Answers1

1

In the example data, the accompanying data is 3 times whitespace chars followed by non whitespace chars to the right.

If that structure is always the same, you can capture the uppercase chars, and capture the other 2 fields inside a lookahead assertion.

([A-Z]+)(?=\s+\S+\s+\S+\s+(\d+(?:\.\d+)?)\s+\S+\s+\S+\s+(\d+))
  • ([A-Z]+) Capture 1+ times a char A-Z in group 1
  • (?= Positive lookahead, assert to the right
    • \s+\S+\s+\S+\s+(\d+(?:\.\d+)?) After 3 fields capture 1+ digits with an optional decimal part in group 2
    • \s+\S+\s+\S+\s+(\d+) After 3 fields, capture 1+ digits in group 3
  • ) Close lookahead

See a regex demo

An example using re.findall to retieve the capture group values:

import re

pattern = r"([A-Z]+)(?=\s+\S+\s+\S+\s+(\d+(?:\.\d+)?)\s+\S+\s+\S+\s+(\d+))"
s = r"ABC FED GHK   45.00 60.00 30.00  4 5 5"
print(re.findall(pattern, s))

Output

[('ABC', '45.00', '4'), ('FED', '60.00', '5'), ('GHK', '30.00', '5')]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thanks a great deal for your response - This looks like it could be the path to go down. The only problem is, I don't know how many rows there are, it could be 1 or as many 20. Is there a simple way of changing your code to account for that? I tried this awful thing but I haven't quite got it: ([A-Z]+)(?=(?:\s+\S+)+(\S+)(?:\s+\S+)+(\S+)) – Gareth Ward Jun 09 '21 at 15:45
  • @GarethWard What do you mean by rows? Do you mean that now you have 3, and there could be 20 parts. In that case you can use a quantifier `([A-Z]+)(?=(?:\s+\S+){3}\s*(\d+(?:\.\d+)?)(?:\s+\S+){3}\s+(\d+))` https://regex101.com/r/ArU5GE/1 Note that this approach only works when the column value is a single "word" – The fourth bird Jun 09 '21 at 15:47
  • Thanks again I'm excited by this! By rows, I mean that in the original pattern, there are 3 'groups' that overlap, i.e. 3 rows, [('ABC', '45.00', '4'), ('FED', '60.00', '5'), ('GHK', '30.00', '5')]. There could be just the one, or maybe 20, for example: [('ABC', '45.00', '4'), ('FED', '60.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5'), ('GHK', '30.00', '5')] so 5 'rows' would look like abc abc abc abc abc 40.00 40.00 40.00 40.00 40.00 4 5 6 5 6, if you get me! – Gareth Ward Jun 09 '21 at 15:58
  • @GarethWard Yes like this `([A-Za-z]+)(?=(?:\s+\S+){4}\s*(\d+(?:\.\d+)?)(?:\s+\S+){4}\s+(\d+))` https://regex101.com/r/9Cn9T6/1 – The fourth bird Jun 09 '21 at 16:05
  • So I can see how to change it if I know how many rows there are before I do the REGEX (and that should be quite simple to work out, so this should get me where I want to go hopefully!). I'm guessing that having it work without knowing how many rows there are going to be would be too big an ask, i.e. a single regex pattern that works on a string containing 3 rows, 4 row, 5 rows etc simultaneously. If I put in a {1,} for example, that just breaks it because it matches as many times as possible. Thanks for your input! – Gareth Ward Jun 09 '21 at 16:56