1

I am trying to do the following: from a list of strings extract anything before the first occurrence (there may be more than one) of a whitespace followed by a round bracket "(".

I have tried the following:

re.findall("(.*)\s\(", line))

but it gives the wring results for e.g. the following strings:

Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]

Thanks in advance

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

1

You can use lookahead for this. Try this regex:

[a-z A-Z]+(?=[ ]+[\(]+)
Tshiteej
  • 121
  • 6
1

To extract anything before the first occurrence of a whitespace char followed by a round bracket ( you may use re.search (this method is meant to extract the first match only):

re.search(r'^(.*?)\s\(', text, re.S).group(1)
re.search(r'^\S*(?:\s(?!\()\S*)*', text).group()

See regex #1 demo and regex #2 demos. Note the second one - though longer - is much more efficient since it follows the unroll-the-loop principle.

Details

  • ^ - start of string
  • (.*?) - Group 1: any 0+ chars as few as possible,
  • \s\( - a whitespace and ( char.

Or, better:

  • ^\S* - start of string and then 0+ non-whitespace chars
  • (?:\s(?!\()\S*)* - 0 or more occurrences of
    • \s(?!\() - a whitespace char not followed with (
    • \S* - 0+ non-whitespace chars

See Python demo:

import re
strs = ['Isla Vista (University of California, Santa Barbara)[2]','Carrollton (University of West Georgia)[2]','Dahlonega (North Georgia College & State University)[2]']
rx = re.compile(r'^\S*(?:\s(?!\()\S*)*', re.S)
for s in strs:
    m = rx.search(s) 
    if m:
        print('{} => {}'.format(s, m.group()))
    else:
        print("{}: No match!".format(s))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563