1

After looking a few similar questions, I have not been able to successfully implement a substring split on my data. For my specific case, I have a bunch of strings, and each string has a substring I need to extract. The strings are grouped together in a list and my data is NBA positions. I need to pull out the positions (either 'PG', 'SG', 'SF', 'PF', or 'C') from each string. Some strings will have more than one position. Here is the data.

text = ['Chi\xa0SG, SF\xa0\xa0DTD','Cle\xa0PF']

The code should ideally look at the first string, 'Chi\xa0SG, SF\xa0\xa0DTD', and return ['SG','SF'] the two positions. The code should look at the second string and return ['PF'].

Community
  • 1
  • 1
Bobe Kryant
  • 2,050
  • 4
  • 19
  • 32
  • 1
    can you add complete expected output for clarity? for ex: is this what you are looking for? `[re.findall(r'\b(PG|SG|SF|PF|C)\b', s) for s in text]` – Sundeep Oct 17 '16 at 04:55

2 Answers2

2

Leverage (zero width) lookarounds:

(?<!\w)PG|SG|SF|PF|C(?!\w)
  • (?<!\w) is zero width negative lookbehind pattern, making sure the desired match is not preceded by any alphanumerics

  • PG|SG|SF|PF|C matches any of the desired patterns

  • (?!\w) is zero width negative lookahead pattern making sure the match is not followed by any alphanumerics

Example:

In [7]: s = 'Chi\xa0SG, SF\xa0\xa0DTD'

In [8]: re.findall(r'(?<!\w)PG|SG|SF|PF|C(?!\w)', s)
Out[8]: ['SG', 'SF']
heemayl
  • 39,294
  • 7
  • 70
  • 76
0

heemayl's response is the most correct, but you could probably get away with splitting on commas and keeping only the last two (or in the case of 'C', the last) characters in each substring.

s = 'Chi\xa0SG, SF\xa0\xa0DTD'
fin = list(map(lambda x: x[-2:] if x != 'C' else x[-1:],s.split(',')))

I can't test this at the moment as I'm on a chromebook but it should work.