splitting string columns while the first part before the splitting pattern is missing

Question

I'm trying to split a string column into different columns and tried How to split a column into two columns?

The pattern of the strings look like the following:

import pandas as pd
import numpy as np

>>> data = {'ab': ['a - b', 'a - b', 'b', 'c', 'whatever']}
>>> df = pd.DataFrame(data=data)

          ab
0      a - b
1      a - b
2          b
3          c
4   whatever

>>> df['a'], df['b'] = df['ab'].str.split('-', n=1).str

         ab          a          b
0     a - b          a          b
1     a - b          a          b
2         b          b        NaN
3         c          c        NaN
4  whatever    whatever       NaN

The expected result is

         ab          a          b
0     a - b          a          b
1     a - b          a          b
2         b        NaN          b
3         c        NaN          c
4  whatever        NaN   whatever

The method I came up with is

df.loc[~ df.ab.str.contains(' - '), 'b'] = df['ab']
df.loc[~ df.ab.str.contains(' - '), 'a'] = np.nan

Is there more generic/efficient way to do this task?

BENY · Answer 1 · 2019-10-03T02:58:32.700

1

I will using get_dummies

s=df['ab'].str.get_dummies(' - ')

s=s.mask(s.eq(1),s.columns.tolist()).mask(s.eq(0))
s
Out[7]: 
     a  b
0    a  b
1    a  b
2  NaN  b

Update

df.ab.str.split(' - ',expand=True).apply(lambda x : pd.Series(sorted(x,key=pd.notnull)),axis=1)
Out[22]: 
      0         1
0     a         b
1     a         b
2  None         b
3  None         c
4  None  whatever

edited Oct 03 '19 at 02:58

answered Oct 03 '19 at 02:18

BENY

317,841
20
164
234

I didn't realize that my example was not general enough.. sorry about that. – steven Oct 03 '19 at 02:38
May I asking the reason for downvote ? OP change the question after I posted the answer – BENY Oct 03 '19 at 02:55

Onyambu · Accepted Answer · 2019-10-03T02:50:37.653

1

we can extractall as long as we know the specific strings to extract:

df.ab.str.extract(r"(a)?(?:\s-\s)?(b)?")
Out[47]: 
     0    1
0    a    b
1    a    b
2  NaN    b
3    a  NaN

data used:

data = {'ab': ['a - b', 'a - b', 'b','a']}
df = pd.DataFrame(data=data)

with your edit, it seems your aim is to put anything that is by itself on the second column. You could do:

df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")
Out[59]: 
   0         1
0  a         b
1  a         b
2            b
3            c
4     whatever

edited Oct 03 '19 at 02:50

answered Oct 03 '19 at 02:24

Onyambu

67,392
3
24
53

the thing is "b" just an example... I should make it clear... editing now. – steven Oct 03 '19 at 02:25
just edited and could please update the regex. Thx. – steven Oct 03 '19 at 02:39
@steven it seems your aim is to put anything that is by itself on the second column. Just do `df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")` – Onyambu Oct 03 '19 at 02:49
Thx! btw, what method would you use to put the split columns back to the df? – steven Oct 03 '19 at 02:54
1

@steven `pd.concat([df,df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")],axis=1)` – Onyambu Oct 03 '19 at 02:55
Hi, buddy. the regex seems only works with a single word. What if I have words like "what ever" instead of "whatever". Could you update the regex for this case? Thanks – steven Oct 03 '19 at 18:56
@steven you could use `df.ab.str.extract(r"(\S+(?=.*-))?(?:\s-\s)?(\b[^-]+)")` – Onyambu Oct 03 '19 at 19:31

splitting string columns while the first part before the splitting pattern is missing

2 Answers2