Whitespace follows by brackets (non lazy) in Python using regex

Question

I am trying to do the following: from a list of strings extract anything before the first occurrence (there may be more than one) of a whitespace followed by a round bracket "(".

I have tried the following:

re.findall("(.*)\s\(", line))

but it gives the wring results for e.g. the following strings:

Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]

Thanks in advance

For the following strings, what do you expect, what else it outputted? — Austin, Apr 13 '20 at 14:48
thanks, not sure I understand what 'r' is in your suggestion. If I try this """re.findall("(\S+)\s+\(", line)""" I get the same problem as before — Bernardino Sassoli de' Bianchi, Apr 13 '20 at 14:52
@Austin, thanks. The actual output is: "CarrolltonGeorgia)[2]*Dahlonega". The expected output is ""Carrolton". — Bernardino Sassoli de' Bianchi, Apr 13 '20 at 14:55

Tshiteej · Answer 1 · 2020-04-13T15:41:30.337

1

You can use lookahead for this. Try this regex:

[a-z A-Z]+(?=[ ]+[\(]+)

edited Apr 13 '20 at 15:41

answered Apr 13 '20 at 14:57

Tshiteej

121
6

Thanks, the problem with that is that I get 'Vista' as the output of 'Isla Vista (University of California, Santa Barbara)[2]'. I am trying instead to get 'Isla Vista'. – Bernardino Sassoli de' Bianchi Apr 13 '20 at 15:05

score 1 · Accepted Answer · answered Apr 13 '20 at 18:18

To extract anything before the first occurrence of a whitespace char followed by a round bracket ( you may use re.search (this method is meant to extract the first match only):

re.search(r'^(.*?)\s\(', text, re.S).group(1)
re.search(r'^\S*(?:\s(?!\()\S*)*', text).group()

See regex #1 demo and regex #2 demos. Note the second one - though longer - is much more efficient since it follows the unroll-the-loop principle.

Details

^ - start of string
(.*?) - Group 1: any 0+ chars as few as possible,
\s\( - a whitespace and ( char.

Or, better:

^\S* - start of string and then 0+ non-whitespace chars
(?:\s(?!\()\S*)* - 0 or more occurrences of
- \s(?!\() - a whitespace char not followed with (
- \S* - 0+ non-whitespace chars

See Python demo:

import re
strs = ['Isla Vista (University of California, Santa Barbara)[2]','Carrollton (University of West Georgia)[2]','Dahlonega (North Georgia College & State University)[2]']
rx = re.compile(r'^\S*(?:\s(?!\()\S*)*', re.S)
for s in strs:
    m = rx.search(s) 
    if m:
        print('{} => {}'.format(s, m.group()))
    else:
        print("{}: No match!".format(s))

Wiktor, thank you so much, this was very helpful and a great answer. — Bernardino Sassoli de' Bianchi, Apr 14 '20 at 18:29

Whitespace follows by brackets (non lazy) in Python using regex

2 Answers2