1

I ran a PDF through a series of processes to extra the text from it. I was successful in that regard. However, now I want to extract specific text from documents.

The document is set up as a multi lined string (I believe. when I paste it into Word the paragraph character is at the end of each line):

Send Unit: COMPLETE

NOA Selection: 20-0429.07

#for some reason, in this editor, despite the next line having > infront of it, the following line (Pni/Trk) keeps wrapping up to the line above. This doesn't exist in the actual doc.

Pni/Trk: 3 Panel / 3 Track

Panel Stack: STD

Width: 142.0000

The information is want to extract are the numbers following "NOA Selection:".

I know I can do a regex something to the effect of:

pattern = re.compile(r'NOA\sSelection:\s\d*-\d*\.\d*)

but I only want the numbers after the NOA selection, especially because NOA Selection will always be the same but the format of the numbers/letters/./-/etc. can vary pretty wildly. This looked promising but it is in Java and I haven't had much luck recreating it in Python.

I think I need to use (?<=...), but haven't been able to implement it.

Also, several of the examples show the string stored in the python file as a variable, but I'm trying to access it from a .txt file, so I might be going wrong there. This is what I have so far.

with open('export1.txt', 'r') as d:    
    contents = d.read()    
    p = re.compile('(?<=NOA)')
    s = re.search(p, contents)
    print(s.group())

Thank you for any help you can provide.

2 Answers2

1

With your shown samples, you could try following too. For sample 20-0429.07 I have kept .07 part optional in regex in case you have values 20-0429 only it should work for those also.

import re
val = """Send Unit: COMPLETE

NOA Selection: 20-0429.07"""
matches = re.findall(r'NOA\s+Selection:\s+(\d+-\d+(?:\.\d+)?)', val)
print(matches)
['20-0429.07']

Explanation: Adding detailed explanation(only for explanation purposes).

NOA\s+Selection:\s+  ##matching NOA spaces(1 or more occurrences) Selection: spaces(1 or more occurrences)
(\d+-\d+(?:\.\d+)?)  ##Creating capturing group matching(1 or more occurrences) digits-digits(1 or more occurrences)
                     ##and in a non-capturing group matching dot followed by digits keeping it optional.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    Thank you for taking the time to reply. I greatly value your explanation. Too often I come here looking for an answer, to get what I need but not really understand what I'm getting. However, your code did not do everything I wanted, and please don't take it personal, it just might have been a miscommunication on what I needed on my part. Your code as is did bring back all the data I was looking for, but when I modified the data to include extra characters, such as 2004/z01.03BZ, nothing was matched. I was looking for a solution that would pull everything after NOA Selection. – Christopher Brown Apr 22 '21 at 15:14
0

Keeping it simple, you could use re.findall here:

inp = """Send Unit: COMPLETE

NOA Selection: 20-0429.07"""

matches = re.findall(r'\bNOA Selection: (\S+)', inp)
print(matches)  # ['20-0429.07']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Hello, thank you for your reply. I ran your code and it worked perfectly. No matter what I put after "NOA Selection", for example "NOA Selection: 2004/z01.03BZ", I got the rest of rest of the string, which is what I wanted. – Christopher Brown Apr 22 '21 at 15:12