Multi line string matcher with optional intervening phrase

Question

I would like to grab Text distributed between two lines.

For Example :

PO Number Dept.number
4000813852 7

I would like to get PO Number 4000813852 It's like a table-based data but in the context of the whole document appears to be normal text.

I have used re.MULTILINE like r'PO Number.*\n[0-9]+'

it workes in this case but it is not the best solution because maybe PO Number comes in the middle as

Invoice Number PO Number Dept.number
123456666     4000813852  7

The latter case is underspecified and a bad match for regex - you would somehow have to guess what number belongs to PO - we would need far much data to fit anything to it. You are capturing far too much text with your regex as you do not use capturing groups. Write line based/column based parser and feed it the parts starting with the complete line that contains *PO Number* till the end of the next line — Patrick Artner, Aug 05 '18 at 12:18
This is a near-duplicate of [Regular expression matching a multiline block of text](https://stackoverflow.com/questions/587345/regular-expression-matching-a-multiline-block-of-text). Your only complication is adding an optional extra expression for 'Dept.number' in between 'PO Number' and \d+ — smci, Aug 05 '18 at 23:24
Actually, unless you can apply special knowledge like "PO Numbers are 10 digits, dept numbers are 1-3" to a multiline regex, then @PatrickArtner is right. First, capture the field names from the first line. Then, figure out which fields from the second line you want. — smci, Aug 06 '18 at 01:48

score 2 · Accepted Answer · answered Aug 05 '18 at 17:23

You can do this with two capture groups and re.DOTALL option enabled. The expression assumes that the number you are interested is the only one with 10 digits in your text.

The expression is:

(PO\sNumber).*(\d{10})

Python snippet:

import re

first_string = """PO Number Dept.number
4000813852 7"""

second_string = """Invoice Number PO Number Dept.number
123456666     4000813853  7"""

PO_first = re.search(r'(PO\sNumber).*(\d{10})',first_string,re.DOTALL)
print(PO_first.group(1)+" "+PO_first.group(2))

PO_second = re.search(r'(PO\sNumber).*(\d{10})',second_string,re.DOTALL)
print(PO_second.group(1)+" "+PO_second.group(2))

Output:

PO Number 4000813852
PO Number 4000813853

score 1 · Answer 2 · answered Aug 05 '18 at 20:36

1

With a single regex:

data="""PO Number Dept.number
    4000813852 7
    Invoice Number PO Number Dept.number
    123456666     4000813852  7
    """

re.findall(r"(PO Number)\s*Dept.number\s*(?:(?:\d+)\s+(\d+)|(\d+))\s+\d",data)
Out: 
[('PO Number', '', '4000813852'), ('PO Number', '4000813852', '')]

I don't use re.MULTILINE, as \s matches newline,too.

answered Aug 05 '18 at 20:36

kantal

2,331
2
8
15

It's not one string, its two different ones. – Paolo Aug 06 '18 at 06:28
@UnbearableLightness "it's not one string," Not a problem, you can apply my regex for a single string, too. – kantal Aug 06 '18 at 10:55

Multi line string matcher with optional intervening phrase

2 Answers2