2

I am trying to parse a report. The following is a sample of the text that I need to parse:

7605625112 DELIVERED N 1 GORDON CONTRACTORS I SIPLAST INC Freight Priority 2000037933 $216.67 1,131 ROOFING MATERIALS
04/23/2021 02:57 PM K WRISHT N 4 CAPITOL HEIGHTS, MD ARKADELPHIA, AR Prepaid 2000037933 -$124.23 170160-00
04/27/2021 12:41 PM 2 40 20743-3706 71923 $.00 055 $.00
2 WBA HOT $62.00 0
$12.92 $92.44
$167.36
7605625123 DELIVERED N 1 SECHRIST HALL CO SIPLAST INC Freight Priority 2000037919 $476.75 871 PAIL,UN1263,PAINT,3,
04/23/2021 02:57 PM S CHAVEZ N 39 HARLINGEN, TX ARKADELPHIA, AR Prepaid 2000037919 -$378.54
04/27/2021 01:09 PM 2 479 78550 71923 $.00 085 $95.35
2 HRL HOT $62.00 21
$13.55 $98.21
$173.76

This comprised of two or more blocks that start with "[0-9]{10}\sDELIVERED" and the last currency string prior to the next block.

If I test with "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$167.36\n)" I successfully get the first Block, but If I use "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$\d\d\d.\d\d\n)" it grabs everything.

If someone can show me the changes that I need to make to return two or more blocks I would greatly appreciate it.

logi-kal
  • 7,107
  • 6
  • 31
  • 43
  • `(?<=` is a positive lookbehind. I can only assume you wanted a positive lookahead `(?=` instead. – MonkeyZeus Jul 28 '21 at 15:03
  • Porbably, all you need is `(?sm)^\d{10}\sDELIVERED.*?(?=\R\d{10}\sDELIVERED|\z)`, see [the regex demo](https://regex101.com/r/MKjOo6/1). Depending on the regex flavor, you will need to adjust the `\z` and `\R` constructs. – Wiktor Stribiżew Jul 28 '21 at 15:13

1 Answers1

0

* is a greedy operator, so it will try to match as much characters as possible. See also Repetition with Star and Plus.

For fixing it, you can use this regex:

(?s)(\d{10}\sDELIVERED)((.(?!\d{10}\sDELIVERED))*)(?<=\$\d\d\d.\d\d)

in which I basically replaced .* with (.(?!\d{10}\sDELIVERED))* so that for every character it checks if it is followed or not by \d{10}\sDELIVERED.

See a demo here

logi-kal
  • 7,107
  • 6
  • 31
  • 43
  • `(.(?!\d{10}\sDELIVERED))*` is a corrupt tempered greedy token, it must be written as `(?:(?!\d{10}\sDELIVERED).)*` – Wiktor Stribiżew Jul 28 '21 at 15:12
  • @WiktorStribiżew It depends on what the OP wants to do. I supposed s/he wanted to use the group 0. – logi-kal Jul 28 '21 at 15:14
  • It does not matter what groups are needed and which are not, the problem is with the syntax. TGT is meant to restrict what is to follow, but placing `.` in front skips the first char. [More here](https://stackoverflow.com/a/37343088/3832970). – Wiktor Stribiżew Jul 28 '21 at 15:15
  • 1
    @WiktorStribiżew I still don't see the problem. The complexity looks the same to me, in that way you would just save a handful of steps, so I would not use the "must" word. – logi-kal Jul 28 '21 at 15:30
  • You would see a problem if the trailing delimiter occurred right after the leading delimiter. Most probably, this will never happen in 99.9999% of cases. However, the `.` must be placed after the lookahead. – Wiktor Stribiżew Jul 28 '21 at 15:32
  • 2
    Thanks to all for your prompt responses. The pattern noted above (?s)(\d{10}\sDELIVERED)((.(?!\d{10}\sDELIVERED))*)(?<=\$\d\d\d.\d\d) solves my problem. – Matt Paisley Jul 28 '21 at 15:33