I have this really irritating problem resulting from a dodgy PDF digitisation.
Anyway, a series of rows with different columns, ideally, would be represented like this:
Code Cost Quantity
ABC 45.00 4
FED 60.00 5
GHK 30.00 5
With regex it is easy to split these into rows and then get each individual column.
However, I find one particularly annoying bit of text always comes out like this instead:
Code Cost Quantity
ABC FED GHK 45.00 60.00 30.00 4 5 5
I cannot for the life of me figure out how to get the regex to separate out each of these overlapped rows, as in the first example. Positive lookaheads can get me some of the way, but what typically happens is that ill get ABC 45.00 4 and then FED 45.00 4, the lookaheads I've built don't iterate through all of the separate columns.
My suspicion is I could use a named pattern or something, matching the first set:
(?>(?<match1>((?>\s|\b)\w{3}\s).+\s+\s(\d+\.\d{2})\s.*\s+\s(\d{1})\s.*))
and then somehow Reuse that capture group, iterating it.
Sticking a positive look ahead only iterates the first group, so I'm obviously doing something stupid:
https://regex101.com/r/Uxx8bZ/1
In theory I could separate out the rows some other way (e.g. for every big space, that's one column) but it seems like this should be possible.
Help appreciated!