Apply pandas function to column to create multiple new columns?

Question

How to do this in pandas:

I have a function extract_text_features on a single text column, returning multiple output columns. Specifically, the function returns 6 values.

The function works, however there doesn't seem to be any proper return type (pandas DataFrame/ numpy array/ Python list) such that the output can get correctly assigned df.ix[: ,10:16] = df.textcol.map(extract_text_features)

So I think I need to drop back to iterating with df.iterrows(), as per this?

UPDATE: Iterating with df.iterrows() is at least 20x slower, so I surrendered and split out the function into six distinct .map(lambda ...) calls.

UPDATE 2: this question was asked back around v0.11.0, before the useability df.apply was improved or df.assign() was added in v0.16. Hence much of the question and answers are not too relevant.

I don't think you can do multiple assignment the way you have it written: `df.ix[: ,10:16]`. I think you'll have to `merge` your features into the dataset. — Zelazny7, Apr 26 '13 at 20:52
For those wanting a much more performant solution [check this one below](https://stackoverflow.com/a/47097625/3707607) which does not use `apply` — Ted Petrou, Nov 03 '17 at 14:08
Most numeric operations with pandas can be vectorized - this means they are much faster than conventional iteration. OTOH, some operations (such as string and regex) are inherently hard to vectorize. This this case, it is important to understand _how_ to loop over your data. More more information on when and how looping over your data is to be done, please read [For loops with Pandas - When should I care?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care/54028200#54028200). — cs95, Jan 04 '19 at 10:15
@coldspeed: the main issue was not choosing which was the higher-performance among several options, it was fighting pandas syntax to get this to work at all, back around [v0.11.0](https://github.com/pandas-dev/pandas/releases?after=v0.13.0_ahl1). — smci, Jan 04 '19 at 11:56
Indeed, the comment is intended for future readers who're looking for iterative solutions, who either don't know any better, or who know what they're doing. — cs95, Jan 04 '19 at 20:42
Of all answers below, most practical and efficient method I found is [this answer](https://stackoverflow.com/a/42072756/4617501). This avoids the overhead of `pd.Series` creation for each row which made it work 30x faster in my case. — Pushkar Nimkar, May 19 '20 at 06:20
@PushkarNimkar: You're neglecting the actual string functions themselves, so it'll be << 30x. But by all means please add your own answer, and benchmark runtime against other approaches. — smci, May 19 '20 at 19:45

ostrokach · Answer 1 · 2017-07-26T14:28:24.953

285

I usually do this using zip:

>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
>>> df
    num
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9

>>> def powers(x):
>>>     return x, x**2, x**3, x**4, x**5, x**6

>>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
>>>     zip(*df['num'].map(powers))

>>> df
        num     p1      p2      p3      p4      p5      p6
0       0       0       0       0       0       0       0
1       1       1       1       1       1       1       1
2       2       2       4       8       16      32      64
3       3       3       9       27      81      243     729
4       4       4       16      64      256     1024    4096
5       5       5       25      125     625     3125    15625
6       6       6       36      216     1296    7776    46656
7       7       7       49      343     2401    16807   117649
8       8       8       64      512     4096    32768   262144
9       9       9       81      729     6561    59049   531441

edited Jul 26 '17 at 14:28

answered Dec 09 '14 at 17:26

ostrokach

17,993
11
78
90

16

But what do you do if you have 50 columns added like this rather than 6? – max Nov 04 '15 at 23:21
20

@max `temp = list(zip(*df['num'].map(powers))); for i, c in enumerate(columns): df[c] = temp[c]` – ostrokach Nov 05 '15 at 00:35
9

@ostrokach I think you meant `for i, c in enumerate(columns): df[c] = temp[i]`. Thanks to this, I really got the purpose of `enumerate` :D – rocarvaj Feb 26 '16 at 04:25
9

This is by far the most elegant and readable solution I've come across for this. Unless you're getting performance problems, the idiom `zip(*df['col'].map(function))` is probably the way to go. – François Leblanc Aug 01 '17 at 20:36
@rocarvaj while I am adding a comment rather too late, if anyone comes across and realize my mistake I'd appreciate the insight. obviously the two thats posted here don't work, it seems to be `for i, c in enumerate(temp): df[c] = temp[c]` – dia Aug 25 '18 at 15:27
1

@XiaoyuLu See https://stackoverflow.com/questions/3394835/args-and-kwargs – ostrokach Oct 18 '18 at 15:29

score 234 · Answer 2 · edited Apr 19 '23 at 01:16

234

In 2020, I use apply() with argument result_type='expand'

applied_df = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
df = pd.concat([df, applied_df], axis='columns')

fn() should return a dict; its keys will be the new column names.

Alternatively you can do a one-liner by also specifying the column names:

df[["col1", "col2", ...]] = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')

edited Apr 19 '23 at 01:16

fantabolous

21,470
7
54
51

answered Sep 17 '18 at 08:45

CircleOnCircles

3,646
1
25
30

19

That is how you do it, nowadays! – Make42 Jul 13 '19 at 11:43
2

This worked out of the box in 2020 while many other questions did not. Also it doesn't use `pd.Series` which is always nice regarding performance issues – Théo Rubenach Mar 12 '20 at 09:48
1

This is a good solution. The only problem is, you can't choose the name for the 2 newly added columns. You need to later do df.rename(columns={0:'col1', 1:'col2'}) – pedram bashiri Mar 27 '20 at 16:00
21

@pedrambashiri If the function you pass to `df.apply` returns a `dict`, the columns will come out named according to the keys. – Seb Apr 16 '20 at 12:09
1

This is the best answer! Often you have a situation where from a single dataframe column or series you have to create a dataframe of multiple new columns based on a transformation on the original column/series. The transformation function often returns k-tuples, and these k-tuples must be separated into k columns, based on some order. @Ben's answer clearly does this very neatly. Thanks! – srm Aug 14 '20 at 13:24
Also, as pointed out somewhere else here, this was introduced in Pandas 0.23.0. For earlier versions I don' t think there is a really fast way of doing this k-tuple -> k columns transformation. – srm Aug 14 '20 at 13:34
The selected answer should be updated to this. Or the question asked again. – vkubicki Apr 21 '21 at 10:00
6

all I needed from this answer was `result_type='expand'`. E.g. `df[new_cols] = df.apply(extract_text_features, axis=1, result_type='expand')` just works. Although you'd need to know names of the new columns. – Ufos Mar 22 '22 at 16:25
Doesn't work for me (fn applied returns a df). Anyone's thoughts? – jtlz2 Mar 31 '22 at 14:12
@jtlz2 you can't use `apply` with a function that returns a df. `apply` normally returns a single value for each input, looping over each row. This question/answer shows how to return multiple values, but they are still just for one row of the source df, so you should either update your function as such, or skip using `apply` and go a different route (such as iteration followed by `concat`, etc) – fantabolous Apr 19 '23 at 00:50

score 134 · Accepted Answer · edited Jul 13 '19 at 11:33

134

Building off of user1827356 's answer, you can do the assignment in one pass using df.merge:

df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), 
    left_index=True, right_index=True)

    textcol  feature1  feature2
0  0.772692  1.772692 -0.227308
1  0.857210  1.857210 -0.142790
2  0.065639  1.065639 -0.934361
3  0.819160  1.819160 -0.180840
4  0.088212  1.088212 -0.911788

EDIT: Please be aware of the huge memory consumption and low speed: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ !

edited Jul 13 '19 at 11:33

Make42

12,236
24
79
155

answered Apr 26 '13 at 20:57

Zelazny7

39,946
18
70
84

2

just out of curiousity, is it expected to use up a lot of memory by doing this? I am doing this on a dataframe that holds 2.5mil rows, and i nearly ran into memory problems (also it is much slower than returning just 1 column). – Jeffrey04 Nov 04 '15 at 07:54
2

'df.join(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})))' would be a better option I think. – skt7 Mar 03 '18 at 20:28
@ShivamKThakkar why do you think your suggestion would be a better option? Would it be more efficient you think or have less memory cost? – tsando May 08 '18 at 10:34
4

Please consider the speed and the memory required: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ – Make42 Jul 13 '19 at 11:31
Is this still true as of 2023 (the huge memory consumption) ? – Eric Burel Aug 23 '23 at 14:48

user1827356 · Answer 4 · 2015-01-22T17:53:40.810

94

This is what I've done in the past

df = pd.DataFrame({'textcol' : np.random.rand(5)})

df
    textcol
0  0.626524
1  0.119967
2  0.803650
3  0.100880
4  0.017859

df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))
   feature1  feature2
0  1.626524 -0.373476
1  1.119967 -0.880033
2  1.803650 -0.196350
3  1.100880 -0.899120
4  1.017859 -0.982141

Editing for completeness

pd.concat([df, df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))], axis=1)
    textcol feature1  feature2
0  0.626524 1.626524 -0.373476
1  0.119967 1.119967 -0.880033
2  0.803650 1.803650 -0.196350
3  0.100880 1.100880 -0.899120
4  0.017859 1.017859 -0.982141

edited Jan 22 '15 at 17:53

answered Apr 26 '13 at 17:39

user1827356

6,764
2
21
30

concat() looks simpler than merge() for connecting the new cols to the original dataframe. – cumin Sep 29 '17 at 14:19
6

nice answer, you don't need to use a dict or a merge if you specify the columns outside of the apply `df[['col1', 'col2']] = df['col3'].apply(lambda x: pd.Series('val1', 'val2'))` – Matt Feb 25 '20 at 10:45

score 87 · Answer 5 · edited Nov 27 '17 at 23:59

87

This is the correct and easiest way to accomplish this for 95% of use cases:

>>> df = pd.DataFrame(zip(*[range(10)]), columns=['num'])
>>> df
    num
0    0
1    1
2    2
3    3
4    4
5    5

>>> def example(x):
...     x['p1'] = x['num']**2
...     x['p2'] = x['num']**3
...     x['p3'] = x['num']**4
...     return x

>>> df = df.apply(example, axis=1)
>>> df
    num  p1  p2  p3
0    0   0   0    0
1    1   1   1    1
2    2   4   8   16
3    3   9  27   81
4    4  16  64  256

edited Nov 27 '17 at 23:59

cscanlin

178
2
8
21

answered May 03 '17 at 21:10

Michael David Watson

3,028
22
36

shouldn't you write: df = df.apply(example(df), axis=1) correct me if I am wrong, I am just a newbie – user299791 Jun 16 '17 at 19:06
1

@user299791, No in this case you are treating example as a first class object so you are passing in the function itself. This function will applied to each row. – Michael David Watson Jun 19 '17 at 17:38
hi Michael, your answer helped me in my problem. Definitely your solution is better than the original pandas' df.assign() method, cuz this is one time per column. Using assign(), if you want to create 2 new columns, you have to use df1 to work on df to get new column1, then use df2 to work on df1 to create the second new column...this is quite monotonous. But your method saved my life!!! Thanks!!! – ACuriousCat Jul 31 '18 at 05:49
1

Won't that run the column assignment code once per row? Wouldn't it be better to return a `pd.Series({k:v})` and serialize the column assignment like in Ewan's answer? – Denis de Bernardy Jul 23 '19 at 15:09
1

If it helps anyone, while this approach is correct and also the simplest of all the presented solutions, updating the row directly like this ended up being surprisingly slow - an order of magnitude slower than the apply with 'expand' + pd.concat solutions – Dmytro Bugayev Jun 30 '20 at 17:33
I actually agree that this solution is the most elegant or clean answer so far, should be accepted. – flgn Mar 30 '21 at 09:24
Making columns before calling apply significantly speeds up the execution. Something like this: example = example.assign(p1=None, p2=None, p3=None) – Fariborz Ghavamian Aug 08 '22 at 02:37

score 63 · Answer 6 · answered Jun 07 '19 at 17:46

63

Just use result_type="expand"

df = pd.DataFrame(np.random.randint(0,10,(10,2)), columns=["random", "a"])
df[["sq_a","cube_a"]] = df.apply(lambda x: [x.a**2, x.a**3], axis=1, result_type="expand")

answered Jun 07 '19 at 17:46

Abhishek

3,337
4
32
51

9

It helps to point out that option is [new in 0.23](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html). The question was asked back on 0.11 – smci Jun 08 '19 at 02:22
1

Nice, this is simple and still works neatly. This is the one I was looking for. Thanks – Isaac Sim Feb 14 '20 at 04:49
1

Duplicates an earlier answer: https://stackoverflow.com/a/52363890/823470 – tar Mar 05 '20 at 14:51
2

@tar actually the second line is different and was quite helpful for me to see! – Aaron Gibralter Nov 13 '20 at 03:49

Joe · Answer 7 · 2020-11-12T07:27:12.850

43

For me this worked:

Input df

df = pd.DataFrame({'col x': [1,2,3]})
   col x
0      1
1      2
2      3

Function

def f(x):
    return pd.Series([x*x, x*x*x])

Create 2 new columns:

df[['square x', 'cube x']] = df['col x'].apply(f)

Output:

   col x  square x  cube x
0      1         1       1
1      2         4       8
2      3         9      27

edited Nov 12 '20 at 07:27

answered Dec 07 '18 at 08:57

Joe

12,057
5
39
55

Evan W. · Answer 8 · 2018-03-08T22:34:57.140

Summary: If you only want to create a few columns, use df[['new_col1','new_col2']] = df[['data1','data2']].apply( function_of_your_choosing(x), axis=1)

For this solution, the number of new columns you are creating must be equal to the number columns you use as input to the .apply() function. If you want to do something else, have a look at the other answers.

Details Let's say you have two-column dataframe. The first column is a person's height when they are 10; the second is said person's height when they are 20.

Suppose you need to calculate both the mean of each person's heights and sum of each person's heights. That's two values per each row.

You could do this via the following, soon-to-be-applied function:

def mean_and_sum(x):
    """
    Calculates the mean and sum of two heights.
    Parameters:
    :x -- the values in the row this function is applied to. Could also work on a list or a tuple.
    """

    sum=x[0]+x[1]
    mean=sum/2
    return [mean,sum]

You might use this function like so:

 df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)

(To be clear: this apply function takes in the values from each row in the subsetted dataframe and returns a list.)

However, if you do this:

df['Mean_&_Sum'] = df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)

you'll create 1 new column that contains the [mean,sum] lists, which you'd presumably want to avoid, because that would require another Lambda/Apply.

Instead, you want to break out each value into its own column. To do this, you can create two columns at once:

df[['Mean','Sum']] = df[['height_at_age_10','height_at_age_20']]
.apply(mean_and_sum(x),axis=1)

For pandas 0.23, you'll need to use the syntax: `df["mean"], df["sum"] = df[['height_at_age_10','height_at_age_20']] .apply(mean_and_sum(x),axis=1)` — SummerEla, Oct 26 '18 at 01:45
This function might raise error. The return function must be `return pd.Series([mean,sum])` — Kanishk Mair, Mar 08 '20 at 22:21

RFox · Answer 9 · 2017-02-06T16:44:40.863

I've looked several ways of doing this and the method shown here (returning a pandas series) doesn't seem to be most efficient.

If we start with a largeish dataframe of random data:

# Setup a dataframe of random numbers and create a 
df = pd.DataFrame(np.random.randn(10000,3),columns=list('ABC'))
df['D'] = df.apply(lambda r: ':'.join(map(str, (r.A, r.B, r.C))), axis=1)
columns = 'new_a', 'new_b', 'new_c'

The example shown here:

# Create the dataframe by returning a series
def method_b(v):
    return pd.Series({k: v for k, v in zip(columns, v.split(':'))})
%timeit -n10 -r3 df.D.apply(method_b)

10 loops, best of 3: 2.77 s per loop

An alternative method:

# Create a dataframe from a series of tuples
def method_a(v):
    return v.split(':')
%timeit -n10 -r3 pd.DataFrame(df.D.apply(method_a).tolist(), columns=columns)

10 loops, best of 3: 8.85 ms per loop

By my reckoning it's far more efficient to take a series of tuples and then convert that to a DataFrame. I'd be interested to hear people's thinking though if there's an error in my working.

This is really useful! I got a 30x speed-up compared to function returning series methods. — Pushkar Nimkar, May 19 '20 at 06:17

Ted Petrou · Answer 10 · 2017-11-03T19:49:30.567

The accepted solution is going to be extremely slow for lots of data. The solution with the greatest number of upvotes is a little difficult to read and also slow with numeric data. If each new column can be calculated independently of the others, I would just assign each of them directly without using apply.

Example with fake character data

Create 100,000 strings in a DataFrame

df = pd.DataFrame(np.random.choice(['he jumped', 'she ran', 'they hiked'],
                                   size=100000, replace=True),
                  columns=['words'])
df.head()
        words
0     she ran
1     she ran
2  they hiked
3  they hiked
4  they hiked

Let's say we wanted to extract some text features as done in the original question. For instance, let's extract the first character, count the occurrence of the letter 'e' and capitalize the phrase.

df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
df.head()
        words first  count_e         cap
0     she ran     s        1     She ran
1     she ran     s        1     She ran
2  they hiked     t        2  They hiked
3  they hiked     t        2  They hiked
4  they hiked     t        2  They hiked

Timings

%%timeit
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
127 ms ± 585 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

def extract_text_features(x):
    return x[0], x.count('e'), x.capitalize()

%timeit df['first'], df['count_e'], df['cap'] = zip(*df['words'].apply(extract_text_features))
101 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Surprisingly, you can get better performance by looping through each value

%%timeit
a,b,c = [], [], []
for s in df['words']:
    a.append(s[0]), b.append(s.count('e')), c.append(s.capitalize())

df['first'] = a
df['count_e'] = b
df['cap'] = c
79.1 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Another example with fake numeric data

Create 1 million random numbers and test the powers function from above.

df = pd.DataFrame(np.random.rand(1000000), columns=['num'])


def powers(x):
    return x, x**2, x**3, x**4, x**5, x**6

%%timeit
df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
       zip(*df['num'].map(powers))
1.35 s ± 83.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Assigning each column is 25x faster and very readable:

%%timeit 
df['p1'] = df['num'] ** 1
df['p2'] = df['num'] ** 2
df['p3'] = df['num'] ** 3
df['p4'] = df['num'] ** 4
df['p5'] = df['num'] ** 5
df['p6'] = df['num'] ** 6
51.6 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I made a similar response with more details here on why apply is typically not the way to go.

score 11 · Answer 11 · answered Sep 29 '18 at 03:17

Have posted the same answer in two other similar questions. The way I prefer to do this is to wrap up the return values of the function in a series:

def f(x):
    return pd.Series([x**2, x**3])

And then use apply as follows to create separate columns:

df[['x**2','x**3']] = df.apply(lambda row: f(row['x']), axis=1)

score 3 · Answer 12 · answered Sep 30 '20 at 10:20

def extract_text_features(feature):
    ...
    ...
    return pd.Series((feature1, feature2)) 

df[['NewFeature1', 'NewFeature1']] = df[['feature']].apply(extract_text_features, axis=1)

Here the a dataframe with a single feature is being converted to two new features. Give this a try too.

score 2 · Answer 13 · answered Sep 30 '22 at 13:39

2

This works for me:

import pandas as pd
import numpy as np
future = pd.DataFrame(
    pd.date_range('2022-09-01',periods=360),
    columns=['date']
)

def featurize(datetime):
    return pd.Series({
        'month':datetime.month,
        'year':datetime.year,
        'dayofweek':datetime.dayofweek,
        'dayofyear':datetime.dayofyear
    })
    
future.loc[
    :,['month','year','dayofweek','dayofyear']
    ] = future.date.apply(featurize)

future.head()

Output:

    date    month   year    dayofweek   dayofyear
0   2022-09-01  9   2022    3           244
1   2022-09-02  9   2022    4           245
2   2022-09-03  9   2022    5           246
3   2022-09-04  9   2022    6           247
4   2022-09-05  9   2022    0           248

answered Sep 30 '22 at 13:39

meowmeow

238
1
6

Neat. I had asked the original question back on pandas 0.11, what's the earliest pandas version this works on? Which version syntax enhances does it rely on? – smci Sep 30 '22 at 13:43
I've personally only tested this on my current version of pandas, which is pandas==1.4.3 but I think it should be pretty compatible with older versions. It looks like '.loc' was around in 0.11: https://pandas.pydata.org/pandas-docs/version/1.0/whatsnew/v0.11.0.html – meowmeow Sep 30 '22 at 14:03
I think the key is creating a Series from a dictionary that matches the column labels – meowmeow Sep 30 '22 at 14:05

score 1 · Answer 14 · answered Jun 24 '18 at 19:06

1

you can return the entire row instead of values:

df = df.apply(extract_text_features,axis = 1)

where the function returns the row

def extract_text_features(row):
      row['new_col1'] = value1
      row['new_col2'] = value2
      return row

answered Jun 24 '18 at 19:06

Saket Bajaj

11
1

No I don't want to apply `extract_text_features` to every column of the df, only to the text column `df.textcol` – smci Jun 24 '18 at 19:29

score 0 · Answer 15 · answered Jul 17 '20 at 09:31

I have a more complicated situation, the dataset has a nested structure:

import json
data = '{"TextID":{"0":"0038f0569e","1":"003eb6998d","2":"006da49ea0"},"Summary":{"0":{"Crisis_Level":["c"],"Type":["d"],"Special_Date":["a"]},"1":{"Crisis_Level":["d"],"Type":["a","d"],"Special_Date":["a"]},"2":{"Crisis_Level":["d"],"Type":["a"],"Special_Date":["a"]}}}'
df = pd.DataFrame.from_dict(json.loads(data))
print(df)

output:

        TextID                                            Summary
0  0038f0569e  {'Crisis_Level': ['c'], 'Type': ['d'], 'Specia...
1  003eb6998d  {'Crisis_Level': ['d'], 'Type': ['a', 'd'], 'S...
2  006da49ea0  {'Crisis_Level': ['d'], 'Type': ['a'], 'Specia...

The Summary column contains dict objects, so I use apply with from_dict and stack to extract each row of dict:

df2 = df.apply(
    lambda x: pd.DataFrame.from_dict(x[1], orient='index').stack(), axis=1)
print(df2)

output:

    Crisis_Level Special_Date Type     
                0            0    0    1
0            c            a    d  NaN
1            d            a    a    d
2            d            a    a  NaN

Looks good, but missing the TextID column. To get TextID column back, I've tried three approach:

Modify apply to return multiple columns:

df_tmp = df.copy()

df_tmp[['TextID', 'Summary']] = df.apply(
    lambda x: pd.Series([x[0], pd.DataFrame.from_dict(x[1], orient='index').stack()]), axis=1)
print(df_tmp)

output:

    TextID                                            Summary
0  0038f0569e  Crisis_Level  0    c
Type          0    d
Spec...
1  003eb6998d  Crisis_Level  0    d
Type          0    a
    ...
2  006da49ea0  Crisis_Level  0    d
Type          0    a
Spec...

But this is not what I want, the Summary structure are flatten.

Use pd.concat:

df_tmp2 = pd.concat([df['TextID'], df2], axis=1)
print(df_tmp2)

output:

    TextID (Crisis_Level, 0) (Special_Date, 0) (Type, 0) (Type, 1)
0  0038f0569e                 c                 a         d       NaN
1  003eb6998d                 d                 a         a         d
2  006da49ea0                 d                 a         a       NaN

Looks fine, the MultiIndex column structure are preserved as tuple. But check columns type:

df_tmp2.columns

output:

Index(['TextID', ('Crisis_Level', 0), ('Special_Date', 0), ('Type', 0),
    ('Type', 1)],
    dtype='object')

Just as a regular Index class, not MultiIndex class.

use set_index:

Turn all columns you want to preserve into row index, after some complicated apply function and then reset_index to get columns back:

df_tmp3 = df.set_index('TextID')

df_tmp3 = df_tmp3.apply(
    lambda x: pd.DataFrame.from_dict(x[0], orient='index').stack(), axis=1)

df_tmp3 = df_tmp3.reset_index(level=0)
print(df_tmp3)

output:

    TextID Crisis_Level Special_Date Type     
                        0            0    0    1
0  0038f0569e            c            a    d  NaN
1  003eb6998d            d            a    a    d
2  006da49ea0            d            a    a  NaN

Check the type of columns

df_tmp3.columns

output:

MultiIndex(levels=[['Crisis_Level', 'Special_Date', 'Type', 'TextID'], [0, 1, '']],
        codes=[[3, 0, 1, 2, 2], [2, 0, 0, 0, 1]])

So, If your apply function will return MultiIndex columns, and you want to preserve it, you may want to try the third method.

score 0 · Answer 16 · answered Mar 28 '23 at 13:50

Although the question specifies that the function should be applied to a Series, most of the answers seem to be applying the function to a DataFrame, with the function getting the relevant column from each row. This seems somewhat inelegant and potentially slow.

Say the function f takes a value in column df["argument"] and returns two values. The nicest way I've found to do it by applying to the column Series is this:

df[["value_1", "value_2"]] = df["argument"].apply(f).to_list()

Unlike DataFrame.apply, unfortunately Series.apply has no result_type parameter to expand the result into a DataFrame to assign to. But pandas understands just as well if you assign to a list of tuples.

Apply pandas function to column to create multiple new columns?

16 Answers16

Example with fake character data

Another example with fake numeric data

Linked

Related