0
new_state_line = """
08 FEB 20 HME FEB08 WEBLW HGH @10:08 359.00
08 FEB 20 HME FEB08 WEBLW HGH @10:10 550.00 912.00
18 FEB 20 JJ MAYOR  WINNER 34.06 875.94
28 FEB 20 ADVICE CONFIRMS RBC280W5F82WW  SOMETING GIVEN 3,459.00 4,333.94
02 MAR 20 STAGECOACH SHOW STOP 59.50 4,277.44
"""

I wrote a following regex pattern:

>>pattern = r'(\d{2}\s[A-Z]{3}\s\d{2}) (.+)\s([0-9,]+\.[0-9]+)\s*(([0-9,]+\.[0-9]+)|$)'<<

for ech_line in new_state_line.split('\n'):
    reg = re.search(pattern, ech_line.upper())
    if(reg):
        print(reg.group(3), reg.group(4))

Which gives output

359.00
912.00 
875.94 
4,333.94 
4,277.44 

Expecting to see output similar to:

359.00    None\''
550.00    912.00
34.06     875.94
3,459.00  4,333.94
59.50     4,277.44

This is in Python. Can someone help in writing the regex pattern? Coz I'm quite lost here.

Andrew
  • 7,602
  • 2
  • 34
  • 42
Dave
  • 3
  • 1
  • It would help if you explain what you need. I assume you want the last two numbers from each line, and when only one number is found, it considers the second one is missing. Is this correct? Do you need to validate anything regarding what comes before the numbers? Do the numbers always have two decimal places? With `(\d{1,3},)?\d{1,3}\.\d{2}`, you capture each one of those numbers. Could that work? – Andrew Jun 18 '20 at 02:07

2 Answers2

1

Your second capturing group is too greedy, and is eating the first of the two number values you want. Adding a '?' to the quantifier will make it lazy and leave the numeric values you want for your third capturing group. Like so:

(\d{2}\s[A-Z]{3}\s\d{2}) (.+?)\s([0-9,]+\.[0-9]+)\s*(([0-9,]+\.[0-9]+)|$)

  • Thanks a lot for that! That was exactly what I was missing. I would say "BINGO!" – Dave Jun 18 '20 at 12:24
  • @Dave, let me ask again as you did not reply to my comment in your question: does your regex really need to capture the whole line and not just the numbers at the end? If your output is what you said, you shouldn't need such a complex regex. – Andrew Jun 19 '20 at 21:39
  • Yes i do need them all. – Dave Jun 19 '20 at 22:03
0

Actually, There is a much simpler way like this:

new_state_line = """
08 FEB 20 HME FEB08 WEBLW HGH @10:08 359.00
08 FEB 20 HME FEB08 WEBLW HGH @10:10 550.00 912.00
18 FEB 20 JJ MAYOR  WINNER 34.06 875.94
28 FEB 20 ADVICE CONFIRMS RBC280W5F82WW  SOMETING GIVEN 3,459.00 4,333.94
02 MAR 20 STAGECOACH SHOW STOP 59.50 4,277.44
"""

lines = new_state_line.split("\n")
result = []
for line in lines:
    data = line.split(" ")
    try:
        float(data[-2])
        result.append((data[-2],data[-1]))
    except:
        result.append((data[-1],))

print(result)

# [('',), ('359.00',), ('550.00', '912.00'), ('34.06', '875.94'), ('4,333.94',), ('59.50', '4,277.44'), ('',)]

much simpler right? If you find my answer helpful plz accept it.

Richardson
  • 74
  • 9