2

I have found some helpful examples of splitting columns, but none as specific as what I am trying to do.

I have a column with state & school information for a list of people that looks like this:

Col1
Miami, FLSchoolA
Charlotte, NCSchoolB

I want to separate after the string , ST but obviously the states are different. I would like this example to look like:

Col1             Col2
Miami, FL        SchoolA
Charlotte, NC    SchoolB

Any help would be much appreciated.

jpp
  • 159,742
  • 34
  • 281
  • 339
a.powell
  • 1,572
  • 4
  • 28
  • 39
  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html – JR ibkr Jan 29 '19 at 15:59
  • @JRibkr `.str.split` isn't particularly useful because here we need to include the splitting condition in the first string. It might be possible with a look behind, but the actual pattern to split on is empty, which might make it impossible. +1 to anyone who figures it out though. – ALollz Jan 29 '19 at 16:05

3 Answers3

3

You can use .str.extract for this with 2 capturing groups.:

df.Col1.str.extract('(.*,\s[A-Z]{2})(.*)')

               0        1
0      Miami, FL  SchoolA
1  Charlotte, NC  SchoolB
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • If you want to be truly pedantic and make sure you only include states, and not for instance, `, GH` then change the regex to `f'(.*,\s[{"|".join(states)}]{{2}})(.*)')` using `states` from @JoshFriedlander. But probably overkill, if your data are clean enough. – ALollz Jan 29 '19 at 16:34
  • When I assign the `.str.extract` function to a column, I lose the second part of the split (the school). How can I store these in the data frame? – a.powell Jan 29 '19 at 19:36
  • @a.powell. A safe alternative would be to concatenate the result `pd.concat([df, df.Col1.str.extract('(.*,\s[A-Z]{2})(.*)')], axis=1)` which will just append those two split columns to your original. But if you dont care about the original anymore, why not just use the `DataFrame` output above (just rename the columns)? – ALollz Jan 29 '19 at 19:42
  • when I attempt to just use the data frame it returns the second column empty. Any clue why? – a.powell Jan 29 '19 at 19:46
  • @a.powell I do not know. With your provided data the split seems to work, If you can create a [mcve] that reproduces the problem it might be best to ask another question as long discussions in comments are generally frowned upon. You can use `df.to_dict()` to easily provide us with the problematic data. – ALollz Jan 29 '19 at 19:49
2

You can use str.index with slicing:

df = pd.DataFrame({'Col1': ['Miami, FLSchoolA', 'Charlotte, NCSchoolB']})

def splitter(val):
    idx = val.index(',') + 4
    return val[:idx], val[idx:]

df['Col1'], df['Col2'] = list(zip(*map(splitter, df['Col1'])))

print(df)

#             Col1     Col2
# 0      Miami, FL  SchoolA
# 1  Charlotte, NC  SchoolB
jpp
  • 159,742
  • 34
  • 281
  • 339
0

The other answers here don't seem to include the specific need to look for states. My suggestion:

df = pd.DataFrame({'Col1': ['Miami, FLSchoolA', 'Charlotte, NCSchoolB']})

states = """AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME
 MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT
 WA WI WV WY""".split()

def splitter(ser):
    state = [x for x in states if x in ser]
    if len(state) > 0:
        st = ', ' + state[0]
        a = ser.split(st)
        return [a[0] + st, a[1]]

df.Col1.transform(splitter)

Returns a series of lists of two, which can easily be separated into columns:

0        ['Miami, FL', 'SchoolA']
1    ['Charlotte, NC', 'SchoolB']
Name: Col1, dtype: object

There's probably a way to write this condition into pd.Series.str.split, so you can fit in the expand=True and do it in one step.

Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75