1

I currently have a column which has data I want to parse, and then put this data on other columns. Currently the best I can get is from using the apply method:

def parse_parent_names(row):
    split = row.person_with_parent_names.split('|')[2:-1]
    return split

df['parsed'] = train_data.apply(parse_parent_names, axis=1).head()

The data is a panda df with a column that has names separated by a pipe (|):

'person_with_parent_names'
|John|Doe|Bobba|
|Fett|Bobba|
|Abe|Bea|Cosby|

Being the rightmost one the person and the leftmost the "grandest parent". I'd like to transform this to three columns, like:

'grandfather'    'father'    'person'
John             Doe         Bobba
                 Fett        Bobba
Abe              Bea         Cosby

But with apply, the best I can get is

'parsed'
[John, Doe,Bobba]
[Fett, Bobba]
[Abe, Bea, Cosby]

I could use apply three times, but it would not be efficient to read the entire dataset three times.

herculanodavi
  • 228
  • 2
  • 12
  • Look here https://stackoverflow.com/questions/39050539/adding-multiple-columns-to-pandas-simultaneously. It looks like the thing you need. – balderman Mar 17 '19 at 15:32

1 Answers1

1

Your function should be changed by compare number of | and split by ternary operator, last pass to DataFrame constructor:

def parse_parent_names(row):
    m = row.count('|') == 4
    split = row.split('|')[1:-1] if m else row.split('|')[:-1]
    return split

cols = ['grandfather','father','person']
df1 = pd.DataFrame([parse_parent_names(x) for x in df.person_with_parent_names],
                    columns=cols)
print (df1)
  grandfather father person
0        John    Doe  Bobba
1               Fett  Bobba
2         Abe    Bea  Cosby
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252