Match words only if preceded by specific pattern

Question

I have a string from a NWS bulletin:

LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks 
KHNX 141001 RECHNX Weather Service San Joaquin Valley

My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:

( )\w{3} #for the first string

and

\w{6} #for the 2nd string

But these find all 3 and 6 character strings leading up to the string I want.

Do you have some logic for the extraction of text besides the length of text? Length of text `3` and `6` is a very broad criteria and may match other tokens too. — Pushpesh Kumar Rajwanshi, Mar 26 '19 at 16:26
If I understand correctly what you want, you need to add word boundaries `\b` to your search. Use it like this: `\b[a-zA-Z]{3}\b` for 3 character strings. From https://stackoverflow.com/questions/29689516/find-words-of-length-4-using-regular-expression — Jona, Mar 26 '19 at 16:31
If you want to match either 3 or 6 uppercase chars from your example data, you could use word boundaries with an alternation `\b(?:[A-Z]{3}|[A-Z]{6})\b` [example](https://regex101.com/r/Nnx1GX/1) — The fourth bird, Mar 26 '19 at 18:57

glhr · Accepted Answer · 2019-03-27T15:01:24.770

1

Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:

(?<=\d{6}\s)[A-Z]+

Demo: https://regex101.com/r/dsDHTs/1

Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:

(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*

Demo: https://regex101.com/r/dsDHTs/5

If you have a specific list of valid fields, you could also simply use:

(AAD|TMLB|RECHNX|RR4HNX)

https://regex101.com/r/dsDHTs/3

edited Mar 27 '19 at 15:01

answered Mar 26 '19 at 16:34

glhr

4,439
1
15
26

Thanks, this is exactly what I was looking for and that demo is super helpful! – klex52s Mar 26 '19 at 19:20
What if you had a combo like `LTUS41 KCAR 141558 AAD RECHNX` and you only wanted to extract "AAD RECHNX"? – klex52s Mar 26 '19 at 19:38
@klex52s it depends. I could extend the regular expression to include all uppercase words after the 6 digits, but that would also include `TMLB` for example. Are `AAD` and `RECHNX` the only strings you want to match? – glhr Mar 27 '19 at 06:31
Yes, just those two strings. – klex52s Mar 27 '19 at 12:23
Actually, what about this. If my string was `LTUS41 KCAR 141558 AAD RR4HNX`, how would I match both AAD and RR4HNX? I've tried using the condition of matching after a 6 digit string, but then it only finds AAD. How would I match characters AND digits in a string. – klex52s Mar 27 '19 at 13:57

score 0 · Answer 2 · answered Mar 26 '19 at 16:32

0

Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):

re.search(r'\b\d+ (\w+)', s).group(1)

answered Mar 26 '19 at 16:32

blhsing

91,368
6
71
106

score 0 · Answer 3 · answered Mar 26 '19 at 16:53

To read first groups of word chars from each line, you can use a pattern like (\w+) (\w+) (\w+) (\w+).

Then, from the first line read group No 4 and from the second line read group No 3.

Look at the following program. It prints four groups from each source line:

import re

txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""

n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
    n += 1
    print(f'{n:2}: {line}')
    mtch = pat.search(line)
    if mtch:
        gr = [ mtch.group(i) for i in range(1, 5) ]
        print(f'    {gr}')

The result is:

 1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks 
    ['LTUS41', 'KCAR', '141558', 'AAD']
 2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
    ['KHNX', '141001', 'RECHNX', 'Weather']

Match words only if preceded by specific pattern

3 Answers3