0

My name regex has been proven faulty on a couple entries:

find_name = re.search(r'^[^\d]*', clean_content)

The above would output something like this on a few entries:

TERRI BROWSING APT A # current output

So, I need a way to trim that out; it's tripping the rest of my program. The only identifier I can think of is if I can somehow detect the second space; and remove all characters after it.

I only need the first and last name; i.e.

TERRI BROWSING # desired

After I remove those characters I could just .strip() out the trailing space, just need a way to remove all after second space.... or maybe detect only to get two words, nothing more.

Dr Upvote
  • 8,023
  • 24
  • 91
  • 204
  • Maybe you need to also validate the first two words that must be uppercase ASCII letters? `re.match("[A-Z]+\s+[A-Z]+", s)`? Otherwise, `\S` based regex does not seem necessary, you may as well use `split`. – Wiktor Stribiżew Jul 29 '19 at 20:33

3 Answers3

7

You don't even need regex since you can use simple splits and joins:

text = 'TERRI BROWSING APT A'
' '.join(text.split(' ')[0:2])
# 'TERRI BROWSING'
Carsten
  • 2,765
  • 1
  • 13
  • 28
2

You can do:

^\S+\s+\S+
  • ^ matches the start of the string

  • \S+ matches one or more non-whitespaces

  • \s+ matches one or more whitespaces


Also, assuming the whitespace is actually a space character, you can find the index of the second space using str.find and slice the string upto that point:

text[:text.find(' ', text.find(' ') + 1)] 

Example:

In [326]: text = 'TERRI BROWSING APT A'                                                                                                                                                                     

In [327]: re.search(r'^\S+\s+\S+', text).group()                                                                                                                                                            
Out[327]: 'TERRI BROWSING'

In [338]: text[:text.find(' ', text.find(' ') + 1)]                                                                                                                                                         
Out[338]: 'TERRI BROWSING'
heemayl
  • 39,294
  • 7
  • 70
  • 76
1

If you want to remove the rest, you could match 2 times a non whitespace char \S* followed by a space and capture that in a group. Then match any char 0+ times and replace with the first capturing group using re.sub

^(\S* \S* ).*

Regex demo | Python demo

import re

print(re.sub(r"^(\S* \S* ).*", r"\1", "TERRI BROWSING APT A"))

Result

TERRI BROWSING

The fourth bird
  • 154,723
  • 16
  • 55
  • 70