2

Problem

I just built this expression within the regex101 editor to extract figures from a form, which has been converted to txt. You can view the regex and sample data here: https://regex101.com/r/P1458h/1/.

^
(\d{1,3})\s+
(?:(?![\d,.]+\n).)+
([\d.,]+)\n

Problem: seems pretty inefficient with 141k+ steps. Any idea how I can improve it?

Explanation

The data source is a multi-line txt extracted from a PDF, resulting in a less-than-perfect output.

I'm trying to extract the box numbers and any number that is present (filled in) for particular lines. If you check the link above you can see the full sample. For example:

Below is a screenshot of Regex101 showing positive matches. The topmost line match shows the box number (155), and the number (34243).

enter image description here

Restrictions/good to know:

  • I need this to work in python - and can use the new regex module if necessary.
  • The number may not always have a comma (,), and is always before the end of a newline (\n).
  • Only match if there is a number/value filled in (e.g. 34243 in the above example). So in that example not matching line with box number 170.
  • The format changes lower down the form, happy to ignore that

Any help would be appreciated! Thanks.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Adam McCann
  • 197
  • 2
  • 8
  • 2
    Your pattern is equal to `^(\d{1,3})\s+.*?([\d.,]+)\n`. Lazy dot pattern is always more efficient than a tempered greedy token with the right-hand boundary pattern. See [When Not to Use this Technique](http://www.rexegg.com/regex-quantifiers.html#tempered_greed). – Wiktor Stribiżew Nov 14 '18 at 15:10
  • 3
    This might be better on [codereview.se] – ChrisGPT was on strike Nov 14 '18 at 15:10
  • 1
    @Chris No, Code Review is for code that already does what you want it to do. This code does not do what the OP wants it to do. – Lightness Races in Orbit Nov 14 '18 at 15:11
  • 3
    @LightnessRacesinOrbit, doesn't it? It looks like "this works, but not very efficiently" to me. Are questions about improving performance a better fit here or on Code Review? I believe the latter. – ChrisGPT was on strike Nov 14 '18 at 15:14
  • 2
    I wonder what is the expected answer here. How much improvement in steps is considered a "right" answer here? BTW, is the pattern dynamic or static? – Wiktor Stribiżew Nov 14 '18 at 15:16
  • Since I cannot post an answer: `^(\d{1,3})\s+(?:(?![\d,.]+\n).+)\s([\d.,]+)\n` you can use this which has 70k steps. If I could post an answer I would have given details. https://regex101.com/r/P1458h/5 – scriptmonster Nov 14 '18 at 15:41
  • 1
    @Chris The way I understood it, Code Review is "I'm done - check it out - any feedback?". This question on the other hand is concretely asking for a problem to be fixed: the problem is "this code is too slow and I need to write different code that's faster". – Lightness Races in Orbit Nov 14 '18 at 15:47
  • @scriptmonster: You now can. – Lightness Races in Orbit Nov 14 '18 at 15:48
  • @LightnessRacesinOrbit, maybe. I don't participate in that community so I'm not an expert. But if it's to be on-topic here I think it needs to be made _much_ more concrete. Doesn't this look like a ["somebody please help me" question](https://meta.stackoverflow.com/q/284236/354577)? As Wiktor says, what's would objectively qualify as a "right" answer? And how can this question help other users? – ChrisGPT was on strike Nov 14 '18 at 15:55
  • 1
    @LightnessRacesinOrbit there are better answers now :) – scriptmonster Nov 14 '18 at 15:56

3 Answers3

4

After optimizing your regex, I came up with this:

^
(\d{1,3})
\b
.+?
\b
([\d.,]+)
\n

Updated Regex Demo Takes 20438 steps for same # of matches

You may replace last \n with $ as well if your input has different line endings.

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

I get the same matches by changing the middle part to simply .+?. There's no need to have a negative lookahead. Instead you can use .+ and add ? to make the + non-greedy so that it doesn't consume digits from the final number.

I also recommend using $ to match end-of-line.

^
(\d{1,3})
.+?
([\d.,]+)
$

Demo: 18 matches, 73263 steps

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • Interestingly, using $ to match end of line rather than \n uses another ~1k steps. But I guess it is more reliable. – Adam McCann Nov 14 '18 at 15:31
0

A slight ellipse resistant improvement of the accepted version

^(\d{1,3})\s.+?\b(\d[\d.,]*)$

20178 steps

PS Previous

^(\d{1,3})\s.+?\b(\d+,?\d+\.?\d+)\n

https://regex101.com/r/BahUUo/3/

20750 steps 18 matches

will fails with small numbers

PS. (updated following scriptmonster comments)

Serge
  • 3,387
  • 3
  • 16
  • 34
  • 3
    This answer is wrong. it matches only 8 times. The last part `(\d+(?:,)\d+)` is wrong. There are some numbers which does not contain comma and some contains both comma and dot. For example `1234` and `3,200.00` does not match. – scriptmonster Nov 14 '18 at 16:07
  • 1
    still not correct, you got it wrong. The last part is not matching correctly please check matched lines. – scriptmonster Nov 14 '18 at 20:57