2

I have a file I want to extract data from using regex that looks like this :

RID:  RSS-130                                         SERVICE                        PAGE:              2   
REPORTING FOR:      100019912 SSSE                      INTSERVICE                    PROC DATE:   15SEP21   
ROLLUP FOR:          100076212 SSSE                          REPORT                        REPORT DATE: 15SEP21   
ENTITY:  1000208212 SSSE                                                                                                 
                                                                                                                                      
                                                                                                                                      
                                                                                                                                      
                                                                                                                                      
 ACQT                                                                                                               
                                                                                                                                      
                                                                                                                                      
   PUR                                                                                                                         
     SAME                      10SEP21                 120            12,263,518             19,48.5                        
                                                                                                                                      
   T PUR                                              120            12,263,518             19,48.5

The regex I wrote to extract the data :

regex_1 = PROC DATE:\s*(\w+).?*     # to get 15SEP21   
regex_2 = T PUR\s*([0-9,]*\s*[0-9,]*)  # to get the first two elements of the line after T PUR

This works but in the file I have multiple records just like this one, under different RID: RSS-130 for example RID: RSS-140, I want to enforce extracting information only that follows RID: RSS-130 and ACQT and stop when that record is over and not carry on extracting data from what ever is under How can I do that?

Desired output would be :

[(15SEP21;120;12,263,518)] for the record that comes under RID: RSS-130 and after ACQT only

Max
  • 412
  • 5
  • 21
  • FYI: Remove `.?*` as it is just some mess in your first regex. Also, if you only want to get 2 details from a block of text, you should not use two separate regexps. You need to *capture* these two substrings within one call to regex. The only thing needed is to define the stop condition for the search, what is the boundary pattern for a block of text. Is it `RID:\s+RSS-\d+`? Also, what is your code? And what is the expected output for the above input? – Wiktor Stribiżew Sep 26 '21 at 12:36
  • Hey @WiktorStribiżew thank you for your answer, `.?*` is there cause i intend to combine the two regex, as for the expected output i modified my question and added – Max Sep 26 '21 at 12:54
  • Did the solution below work? – Wiktor Stribiżew Sep 26 '21 at 19:46

1 Answers1

1

I suggest leveraging a tempered greedy token here:

(?s)PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)

See the regex demo. Details:

  • (?s) - an inline re.S / re.DOTALL modifier
  • PROC DATE: - a literal text
  • \s* - zero or more whitespaces
  • (?P<date>\w+) - Group "date": one or more word chars
  • (?:(?!RID:\s+RSS-\d).)* - any single char, zero or more but as many as possible occurrences, that does not start a RID:\s+RSS-\d pattern (block start pattern, RID:, one or more whitespaces, RSS- and a digit)
  • T PUR - a literal string
  • \s+ - one or more whitespaces
  • (?P<num>\d[.,\d]*) - Group "num": a digit and then zero or more commas, dots and digits
  • \s+ - one or more digits
  • (?P<val>\d[\d,]*) - Group "val": a digit and then zero or more commas or digits.

See the Python demo:

import re
text = "RID:  RSS-130                                         SERVICE                        PAGE:              2   \nREPORTING FOR:      100019912 SSSE                      INTSERVICE                    PROC DATE:   15SEP21   \nROLLUP FOR:          100076212 SSSE                          REPORT                        REPORT DATE: 15SEP21   \nENTITY:  1000208212 SSSE                                                                                                 \n                                                                                                                                      \n                                                                                                                                      \n                                                                                                                                      \n                                                                                                                                      \n ACQT                                                                                                               \n                                                                                                                                      \n                                                                                                                                      \n   PUR                                                                                                                         \n     SAME                      10SEP21                 120            12,263,518             19,48.5                        \n                                                                                                                                      \n   T PUR                                              120            12,263,518             19,48.5"
rx = r"PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)"
m = re.search(rx, text, re.DOTALL)
if m:
    print(m.groupdict())

# => {'date': '15SEP21', 'num': '120', 'val': '12,263,518'}

If you MUST check for T PUR after ACQT, modify the pattern to

(?s)PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d|ACQT).)*ACQT(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)

See this regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563