Problem
I just built this expression within the regex101 editor to extract figures from a form, which has been converted to txt. You can view the regex and sample data here: https://regex101.com/r/P1458h/1/.
^
(\d{1,3})\s+
(?:(?![\d,.]+\n).)+
([\d.,]+)\n
Problem: seems pretty inefficient with 141k+ steps. Any idea how I can improve it?
Explanation
The data source is a multi-line txt extracted from a PDF, resulting in a less-than-perfect output.
I'm trying to extract the box numbers and any number that is present (filled in) for particular lines. If you check the link above you can see the full sample. For example:
Below is a screenshot of Regex101 showing positive matches. The topmost line match shows the box number (155), and the number (34243).
Restrictions/good to know:
- I need this to work in python - and can use the new regex module if necessary.
- The number may not always have a comma (,), and is always before the end of a newline (\n).
- Only match if there is a number/value filled in (e.g. 34243 in the above example). So in that example not matching line with box number 170.
- The format changes lower down the form, happy to ignore that
Any help would be appreciated! Thanks.