regular expressions (regex) save parts of sentence

Question

New to python and regular expressions, I have been trying to find a way that I can parse a sentence so that I can take parts of it and assign them to their own variables.

An example sentence is: Laura Compton, a Stock Broker from Los Angeles, California

My objective is to have: name = "Laura Compton" ( this one is the easy one, I can target the anchor link no problem), position = "Stock Broker", city = Los Angeles, state = California

All of the sentences I need to iterate over follow the same pattern, name is always in the anchor tag, the position always follows the , after the closing anchor, sometimes its uses "a" or "an" so I would like to strip those off. The city and state always follow the word "from" .

martineau · Answer 1 · 2017-12-01T13:06:22.313

2

You can use named groups within patterns to capture substrings, which makes referring to them easier and the code doing so slightly more readable:

import re

data = ['Laura Compton, a Stock Broker from Los Angeles, California',
        'Miles Miller, a Soccer Player from Seattle, Washington']

pattern = (r'^(?P<name>[^,]+)\, an? (?P<position>.+) from '
           r'(?P<city>[^,]+)\, +(?P<state>.+)')

FIELDS = 'name', 'position', 'city', 'state'

for sentence in data:
    matches = re.search(pattern, sentence)
    name, position, city, state = matches.group(*FIELDS)
    print(', '.join([name, position, city, state]))

Output produced from sample data:

Laura Compton, Stock Broker, Los Angeles, California
Miles Miller, Soccer Player, Seattle, Washington

A.M. Kuchling wrote a good tutorial titled Regular Expression HOWTO you ought to check-out.

edited Dec 01 '17 at 13:06

answered Nov 29 '17 at 00:33

martineau

119,623
25
170
301

Might be smart to compile here if `pattern` is being used repetitively. – Brad Solomon Nov 29 '17 at 00:39
@BradSolomon: Not so much because the `re` module automatically caches complied versions of the most recently used regexes—so how often one is used is often irrelevant. – martineau Nov 29 '17 at 00:42
I guess you are right @martineau. So really the only reason in 3.x to use compile is the second reason given [here](https://stackoverflow.com/a/47269110/7954504)? – Brad Solomon Nov 29 '17 at 00:49
1

@Brad: I suppose so, Personally I seldom ever bother because it literally usually isn't worth the trouble. Compiling regexes is usually a very insignificant part of the overall processing being done, so even if it didn't automatically cache them and it happened many times I wouldn't be too worried about it. – martineau Nov 29 '17 at 00:58

score 1 · Answer 2 · answered Nov 28 '17 at 23:38

You can try this:

import re
s = "Laura Compton, a Stock Broker from Los Angeles, California"
new_s = re.findall('^[a-zA-Z\s]+|(?<=a\s)[a-zA-Z\s]+(?=from)|(?<=an\s)[a-zA-Z\s]+(?=from)|(?<=from\s)[a-zA-Z\s]+(?=,)|(?<=,\s)[a-zA-Z\s]+$', s)
headers = ['name', 'title', 'city', 'state']
data = {a:b for a, b in zip(headers, new_s)}

Output:

{'city': 'Los Angeles', 'state': 'California', 'name': 'Laura Compton', 'title': 'Stock Broker '}

regular expressions (regex) save parts of sentence

2 Answers2