1

Revision of prior question:

How can I replace all ", " (i.e. comma then space) with "_" except when ", " (comma then space) is followed by the word "LLC" or "Inc" (then do nothing)?

I want to change:

  1. "TEXAS ENERGY MUTUAL, LLC, BOBBY GILLIAM, STEVE PEREIRA and ANDY STITT"
  2. "Grape, LLC, Andrea Gray, Jack Smith"
  3. "Stephen Winters, Apple, pear, Inc, Sarah Smith"

To this:

  1. "TEXAS ENERGY MUTUAL, LLC_BOBBY GILLIAM_STEVE PEREIRA_ANDY STITT"
  2. "Grape, LLC_Andrea Gray_Jack Smith"
  3. "Stephen Winters_Apple_pear, Inc_Sarah Smith"

I thought it would start with some variation of the code below but I cannot figure out the except conditions.

df['Column_Name'] = df['Column_Name'].str.replace(', ','_') Cheers!

Andrew
  • 73
  • 1
  • 9
  • You can search for `, ` and `, LLC` and `, lnc` indices and take only what not intersect with the 2 last options. Well recommended convert the indices to `set` – Yossi Levi Oct 25 '20 at 05:32
  • Is the number of spaces fixed to one after `,`? Or can there be 0, 1 or 2 or more spaces? `, (?!Inc|LLC)` won't work then, else, it is a solution (a word boundary might be handy here, but it depends on the actual requirements). – Wiktor Stribiżew Oct 25 '20 at 12:04
  • Try `replace(r',(?!\s+(?:LLC|Inc)\b)\s+', '_')` – Wiktor Stribiżew Oct 25 '20 at 18:09

3 Answers3

1

Use python regex module re for with the pattern , (?!Inc|LLC) to find all occurrence of , without following Inc or LLC

import re

strings = ["Banana, orange", "Grape, LLC", "Apple, pear, Inc"]

[re.sub(", (?!Inc|LLC)",'_',string) for string in strings]
#['Banana_orange', 'Grape, LLC', 'Apple_pear, Inc']
Thân LƯƠNG Đình
  • 3,082
  • 2
  • 11
  • 21
chai
  • 186
  • 10
  • If LLC was not in the end of the string, it will not work. Not sure if that must happen by the little examples – Yossi Levi Oct 25 '20 at 05:37
1

You can replace using a regex with a negative lookahead:

#no idea why Inc|LLC or LLC|Inc will skip the first
df['Column_Name'].str.replace(', (?!=|Inc|LLC)', '_')

Output:

0    TEXAS ENERGY MUTUAL, LLC_BOBBY GILLIAM_STEVE P...
1                    Grape, LLC_Andrea Gray_Jack Smith
2          Stephen Winters_Apple_pear, Inc_Sarah Smith
Name: ColumnName, dtype: object

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • This doesn't add the "_". – Andrew Oct 25 '20 at 17:52
  • @Andrew is that the case in the output I've shown here? Not sure what you mean. – ernest_k Oct 25 '20 at 17:55
  • @Andrew Then why do you post another question? Fix it here. – Wiktor Stribiżew Oct 25 '20 at 18:04
  • @Andrew I'm not seeing what I'm missing. When I use your new input, it produces exactly what you're expecting. I need you to clarify *"this doesn't add the _"* – ernest_k Oct 25 '20 at 18:18
  • Well, there's no need to accept the answer unless you find it helpful... but `', (?!=|Inc|LLC)'` is a regex using a negative lookahead assertion, that is: `, ` followed by anything that is not `Inc` or `LLC`. `(?!=...)` is the *negative* lookahead, saying followed by something that is not `...`. – ernest_k Oct 25 '20 at 18:50
0

the simple way:

def replace(str):
   x = str.split(', ')
   buf = x[0]
   for i in range(1, len(x)): 
      if x[i].startswith('LLC'):
         buf += ', ' + x[i]
      elif x[i].startswith('Inc'):
         buf += ', ' + x[i]
      else:
         buf += '_' + x[i]
   return buf

and then try replace('a, b, LLC, d')

Doz Parp
  • 279
  • 4
  • 23