2

Given df_people:

   Name
0  Tom
1  Jerry

and df_colors (no header row):

0  Red
1  Green
2  Blue

What is considered an optimum way to take the data in df_colors and add it to df_people such that df_people would look like this when combined:

   Name   Color_0  Color_1  Color_2
0  Tom    Red      Green    Blue
1  Jerry  Red      Green    Blue

Below is what I have so far, which works, but I was wondering if there is a better or more concise way.

# Store data for new columns in a dictionary
new_columns = {}
for index_people, row_people in df_people.iterrows():
    for index_colors, row_colors in df_colors.iterrows():
        key = 'Color_' + str(index_colors)
        if (key in new_columns):
            new_columns[key].append(row_colors[0])
        else:
            new_columns[key] = [row_colors[0]]

# Add dictionary data as new columns
for key, value in new_columns.items():
    df_people[key] = value

UPDATE

Thank you all for providing answers. Since the real dataframes are GBs in size, speed was crucial, so I ended up going with the fastest method. Here is the code to the test cases:

# Import required modules
import pandas as pd
import timeit

# Original
def method_1():
    df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
    df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)
    # Store data for new columns in a dictionary
    new_columns = {}
    for index_people, row_people in df_people.iterrows():
        for index_colors, row_colors in df_colors.iterrows():
            key = 'Color_' + str(index_colors)
            if (key in new_columns):
                new_columns[key].append(row_colors[0])
            else:
                new_columns[key] = [row_colors[0]]
    # Add dictionary data as new columns
    for key, value in new_columns.items():
        df_people[key] = value

# YOBEN_S - https://stackoverflow.com/a/60805881/452587
def method_2():
    df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
    df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)
    _s = pd.concat([df_colors]*len(df_people), axis=1)
    _s.columns = df_people.index
    df_people = df_people.join(_s.T.add_prefix('Color_'))

# Dani Mesejo - https://stackoverflow.com/a/60805898/452587
def method_3():
    df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
    df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)
    # Create mock key
    _m1 = df_people.assign(key=1)
    # Set new column names, transpose, and create mock key
    _m2 = df_colors.set_index('Color_' + df_colors.index.astype(str)).T.assign(key=1)
    df_people =  _m1.merge(_m2, on='key').drop('key', axis=1)

# Erfan - https://stackoverflow.com/a/60806018/452587
def method_4():
    df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
    df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)
    df_colors = df_colors.T.reindex(df_people.index).ffill().add_prefix('Color_')
    df_people = df_people.join(df_colors)

print('Method 1:', timeit.timeit(method_1, number=10000))
print('Method 2:', timeit.timeit(method_2, number=10000))
print('Method 3:', timeit.timeit(method_3, number=10000))
print('Method 4:', timeit.timeit(method_4, number=10000))

Output:

Method 1: 36.029883089
Method 2: 27.042384837999997
Method 3: 68.22421793800001
Method 4: 32.94155895

In my effort to simplify the scenario, unfortunately I oversimplified it. It's too late to rephrase the question now, so I think I will post a related question at a later date. The real scenario involves mathematics as well, so instead of simply appending columns in df_colors to df_people, I also need to perform some calculations against a column in the corresponding row for each added cell.

UPDATE 2

I've made the sample dataframes larger (thanks jezrael) and added two new methods.

# Import required modules
import numpy as np
import pandas as pd
import timeit

# Original
def method_1():
    df_people = pd.DataFrame(['Tom', 'Jerry', 'Bob', 'John', 'Bill', 'Tim', 'Harry', 'Rick'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red', 'Green', 'Blue'] * 10, columns=None)
    # Store data for new columns in a dictionary
    new_columns = {}
    for index_people, row_people in df_people.iterrows():
        for index_colors, row_colors in df_colors.iterrows():
            key = 'Color_' + str(index_colors)
            if (key in new_columns):
                new_columns[key].append(row_colors[0])
            else:
                new_columns[key] = [row_colors[0]]
    # Add dictionary data as new columns
    for key, value in new_columns.items():
        df_people[key] = value

# YOBEN_S - https://stackoverflow.com/a/60805881/452587
def method_2():
    df_people = pd.DataFrame(['Tom', 'Jerry', 'Bob', 'John', 'Bill', 'Tim', 'Harry', 'Rick'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red', 'Green', 'Blue'] * 10, columns=None)
    _s = pd.concat([df_colors]*len(df_people), axis=1)
    _s.columns = df_people.index
    df_people = df_people.join(_s.T.add_prefix('Color_'))

# sammywemmy - https://stackoverflow.com/a/60805964/452587
def method_3():
    df_people = pd.DataFrame(['Tom', 'Jerry', 'Bob', 'John', 'Bill', 'Tim', 'Harry', 'Rick'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red', 'Green', 'Blue'] * 10, columns=None)
    # Create a new column in df_people with aggregate of df_colors;
    df_people['Colors'] = df_colors[0].str.cat(sep=',')
    # Concatenate df_people['Name'] and df_people['Colors'];
    # split column, expand into a dataframe, and add prefix
    df_people = pd.concat([df_people.Name, df_people.Colors.str.split(',', expand=True).add_prefix('Color_')], axis=1)

# Dani Mesejo - https://stackoverflow.com/a/60805898/452587
def method_4():
    df_people = pd.DataFrame(['Tom', 'Jerry', 'Bob', 'John', 'Bill', 'Tim', 'Harry', 'Rick'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red', 'Green', 'Blue'] * 10, columns=None)
    # Create mock key
    _m1 = df_people.assign(key=1)
    # Set new column names, transpose, and create mock key
    _m2 = df_colors.set_index('Color_' + df_colors.index.astype(str)).T.assign(key=1)
    df_people =  _m1.merge(_m2, on='key').drop('key', axis=1)

# Erfan - https://stackoverflow.com/a/60806018/452587
def method_5():
    df_people = pd.DataFrame(['Tom', 'Jerry', 'Bob', 'John', 'Bill', 'Tim', 'Harry', 'Rick'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red', 'Green', 'Blue'] * 10, columns=None)
    df_colors = df_colors.T.reindex(df_people.index).ffill().add_prefix('Color_')
    df_people = df_people.join(df_colors)

# jezrael - https://stackoverflow.com/a/60826723/452587
def method_6():
    df_people = pd.DataFrame(['Tom', 'Jerry', 'Bob', 'John', 'Bill', 'Tim', 'Harry', 'Rick'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red', 'Green', 'Blue'] * 10, columns=None)
    _a = np.broadcast_to(df_colors[0], (len(df_people), len(df_colors)))
    df_people = df_people.join(pd.DataFrame(_a, index=df_people.index).add_prefix('Color_'))

print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
print('Method 4:', timeit.timeit(method_4, number=3))
print('Method 5:', timeit.timeit(method_5, number=3))
print('Method 6:', timeit.timeit(method_6, number=3))

Output:

Method 1: 74.512771493
Method 2: 1.0007798979999905
Method 3: 0.40823360299999933
Method 4: 0.08115736700000298
Method 5: 0.11704620100000795
Method 6: 0.04700596800000767

UPDATE 3

I've posted a related question for transposing and calculating, which more accurately reflects the real dataset:

Fastest way to transpose and calculate in Pandas?

thdoan
  • 18,421
  • 1
  • 62
  • 57

6 Answers6

1

We can do

s=pd.concat([df1]*len(df),axis=1)
s.columns=df.index
df=df.join(s.T.add_prefix('color_'))
    Name color_0 color_1 color_2
0    Tom     Red   Green    Blue
1  Jerry     Red   Green    Blue
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You can improve performance by numpy.broadcast_to method:

df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)

a = np.broadcast_to(df_colors[0], (len(df_people), len(df_colors)))
df = df_people.join(pd.DataFrame(a, index=df_people.index).add_prefix('Color_'))
print (df)
    Name Color_0 Color_1 Color_2
0    Tom     Red   Green    Blue
1  Jerry     Red   Green    Blue

import timeit

def method_2():
    df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
    df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)
    _s = pd.concat([df_colors]*len(df_people), axis=1)
    _s.columns = df_people.index
    df_people = df_people.join(_s.T.add_prefix('Color_'))

def method_5():
    df_people = pd.DataFrame([['Tom'], ['Jerry']], columns=['Name'])
    df_colors = pd.DataFrame([['Red'], ['Green'], ['Blue']], columns=None)
    a = np.broadcast_to(df_colors[0], (len(df_people), len(df_colors)))
    df_people = df_people.join(pd.DataFrame(a, index=df_people.index).add_prefix('Color_'))

print('Method 2:', timeit.timeit(method_2, number=10000))
Method 2: 27.919169027998578

print('Method 5:', timeit.timeit(method_5, number=10000))
Method 5: 21.452649746001043

But I think better is test in large DataFrame, e.g. here for 3k rows and 30 columns, then timings are different:

# Import required modules
import pandas as pd
import timeit

# Original
def method_1():
    df_people = pd.DataFrame(['Tom','Jerry','Bob'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red','Green', 'Blue'] * 10, columns=None)
    # Store data for new columns in a dictionary
    new_columns = {}
    for index_people, row_people in df_people.iterrows():
        for index_colors, row_colors in df_colors.iterrows():
            key = 'Color_' + str(index_colors)
            if (key in new_columns):
                new_columns[key].append(row_colors[0])
            else:
                new_columns[key] = [row_colors[0]]
    # Add dictionary data as new columns
    for key, value in new_columns.items():
        df_people[key] = value

# YOBEN_S - https://stackoverflow.com/a/60805881/452587
def method_2():
    df_people = pd.DataFrame(['Tom','Jerry','Bob'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red','Green', 'Blue'] * 10, columns=None)
    _s = pd.concat([df_colors]*len(df_people), axis=1)
    _s.columns = df_people.index
    df_people = df_people.join(_s.T.add_prefix('Color_'))

# Dani Mesejo - https://stackoverflow.com/a/60805898/452587
def method_3():
    df_people = pd.DataFrame(['Tom','Jerry','Bob'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red','Green', 'Blue'] * 10, columns=None)
    # Create mock key
    _m1 = df_people.assign(key=1)
    # Set new column names, transpose, and create mock key
    _m2 = df_colors.set_index('Color_' + df_colors.index.astype(str)).T.assign(key=1)
    df_people =  _m1.merge(_m2, on='key').drop('key', axis=1)

# Erfan - https://stackoverflow.com/a/60806018/452587
def method_4():
    df_people = pd.DataFrame(['Tom','Jerry','Bob'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red','Green', 'Blue'] * 10, columns=None)
    df_colors = df_colors.T.reindex(df_people.index).ffill().add_prefix('Color_')
    df_people = df_people.join(df_colors)

def method_5():
    df_people = pd.DataFrame(['Tom','Jerry','Bob'] * 1000, columns=['Name'])
    df_colors = pd.DataFrame(['Red','Green', 'Blue'] * 10, columns=None)
    a = np.broadcast_to(df_colors[0], (len(df_people), len(df_colors)))
    df_people = df_people.join(pd.DataFrame(a, index=df_people.index).add_prefix('Color_'))

print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
print('Method 4:', timeit.timeit(method_4, number=3))
print('Method 5:', timeit.timeit(method_5, number=3))

Method 1: 34.91457201199955
Method 2: 0.7901797180002177
Method 3: 0.05690281799979857
Method 4: 0.05774562500118918
Method 5: 0.026483284000278218
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • jezrael, my [new question](https://stackoverflow.com/q/60843541/452587) is closer to what I'm trying to do. I tried using `np.broadcast_to()` but failed :(. If you could adapt your solution to the new question and it's faster than what I have now, I will surely accept it. This time I will wait 48 hours instead of 24 hours. – thdoan Mar 25 '20 at 06:04
0

You could do:

import pandas as pd

# input sample data
df1 = pd.DataFrame([['Tom'], ['Jerry']], columns=['name'])
df2 = pd.DataFrame([['Red'], ['Gree'], ['Blue']], columns=None)

# create mock key
m1 = df1.assign(key=1)

# set new column names, transpose and create mock key
m2 = df2.set_index('Color_' + df2.index.astype(str)).T.assign(key=1)

result = m1.merge(m2, on='key').drop('key', axis=1)

print(result)

Output

    name Color_0 Color_1 Color_2
0    Tom     Red    Gree    Blue
1  Jerry     Red    Gree    Blue
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0

Another possible solution:

#create a new column in df1, with aggregate of df2:
 #i set the header for df2 column as 'color'
 df1['color'] = df2['color'].str.cat(sep=',')
#concatenate df1['Name'] and df1['Color'] as below:
pd.concat([df1.Name,
            #split column, expand into a dataframe and add prefix
           df1.color.str.split(',',expand=True).add_prefix('color_')],
          axis=1)

    Name    color_0 color_1 color_2
0   Tom       Red   Green   Blue
1   Jerry     Red   Green   Blue
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thanks for the answer. I got this error: 'DataFrame' object has no attribute 'color' (I think it's referring to `df2.color`). – thdoan Mar 24 '20 at 05:51
  • I think the problem is that df_colors has no column names, so df_colors['color'] doesn't exist. Changing to column index fixed it: `df_colors[0]`. I'll update the timings with your method as well. – thdoan Mar 24 '20 at 16:58
0

Using DataFrame.reindex, DataFrame.ffill and DataFrame.add_prefix:

df2 = df2.T.reindex(df1.index).ffill().add_prefix('Color_')
df1 = df1.join(df2)

    Name Color_0 Color_1 Color_2
0    Tom     Red   Green    Blue
1  Jerry     Red   Green    Blue
Erfan
  • 40,971
  • 8
  • 66
  • 78
0

you can use:

colors = df_colors.T.append(df_colors.T).add_prefix('Color_').reset_index(drop=True)
pd.concat([df_people, colors], axis=1)

output:

enter image description here

kederrac
  • 16,819
  • 6
  • 32
  • 55