9

I have a Pandas DataFrame that looks like this:

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      3       6       9

I would like to create a Pandas DataFrame like this:

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       9
3      2       4       7
4      2       5       8
5      2       6       9
6      3       4       7
7      3       5       8
8      3       6       9

Is there built-in or combination of built-in Pandas methods that can achieve this?

Even if there are duplicates in df, I would like the output to be the same format. In other words:

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      2       6       8

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       8
3      2       4       7
4      2       5       8
5      2       6       8
6      2       4       7
7      2       5       8
8      2       6       8
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Erniu
  • 125
  • 4
  • 3
    Can you include some more details about your scaling logic? It looks like you're taking the Cartesian product between col1 and the rest of the columns which would just be a [standard cross join](/a/53699198/15497888) between col1 and [col2, col3] which would be significantly faster than manual scaling. Something like `df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)` Is this correct or do you actually just need to scale your DataFrame a fixed number of times based on column values or otherwise? – Henry Ecker May 26 '23 at 04:54
  • 1
    @HenryEcker. Great use of `how='cross'`! It could also be `to_tile=['col1']; df[to_tile].merge(df[df.columns.difference(to_tile)], how='cross').reset_index(drop=True)` – Corralien May 26 '23 at 06:24

6 Answers6

7

I would love to see a more pythonic or a 'pandas-exclusive' answer, but this one also works good!

import pandas as pd
import numpy as np

n=3

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})

# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))

""" Old method.
for col in df.columns[1:]:
   df2[col] = np.tile(df[col].values, n)

"""
print(df2)
rr_goyal
  • 467
  • 2
  • 8
7

I would also have gone for a cross merge as suggested by @Henry in comments:

out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

Comparison of the different approaches:

enter image description here

Note that @sammywemmy's approach behaves differently when rows are duplicated, which leads to a non comparable timing.

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Interesting timing data. It kind of looks like the other solutions might be going to catch mine as length increases... – Nick May 26 '23 at 07:51
  • for the nick_concat solution, I suspect the copies affect the speed (a copy for drop, a copy for assign, a copy for reset_index). – sammywemmy May 26 '23 at 08:13
  • @Nick I added some more rows (it previously crashed on another computer as the combinations are memory expensive). Looks like they converge with still a difference – mozway May 26 '23 at 08:57
  • 2
    @sammywemmy regarding your approach, it doesn't provide the same output as the others when there are duplicated rows (thus the different timing) – mozway May 26 '23 at 08:58
  • 4
    @mozway Cool stuff. Interesting to see itertools as the slowest option. I love questions that get such a variety of responses and someone doing the timing. – Nick May 26 '23 at 09:05
  • @Nick I did the timings on repeats of OP's dataset (`lambda n: pd.concat([df]*(n//3), ignore_index=True)` where `n` is the X-axis value), it might also be interesting to test randomly generated data. It might behave differently :) – mozway May 26 '23 at 09:13
  • I tried using `random.choices(range(100), k=100)` for `col1`, `col2` and `col3` and the clear winner was @rr_goyal answer. Times for 1000 iterations: rr_goyal 0.44, mozway 1.3, PaulS 2.87, Nick 8.17 :( I don't have janitor so couldn't test – Nick May 26 '23 at 12:21
  • Using `random.choices(range(10000), k=10000)` rr_goyal 1.16, mozway 3.13, Nick 3.71, PaulS 34.18(!) – Nick May 26 '23 at 12:27
  • I couldn't get 100k to run (ran out of memory on my 64GB PC), but for 30000 rr_goyal: 11.66, mozway 71.01, Nick 87.8. Didn't wait for PaulS to finish. – Nick May 26 '23 at 12:35
  • Just realised I missed out Corallien's answer, at 30k it's 13.33 so second fastest. – Nick May 26 '23 at 12:38
  • @Nick feel free to edit those in my answer if you like – mozway May 26 '23 at 13:13
  • 2
    Great to learn a variety of new approaches! – rr_goyal May 28 '23 at 23:12
6

You could concatenate copies of the dataframe, with col1 replaced in each copy by each of the values in col1:

out = df.drop('col1', axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)

Output:

   col2  col3  col1
0     4     7     1
1     5     8     1
2     6     9     1
3     4     7     2
4     5     8     2
5     6     9     2
6     4     7     3
7     5     8     3
8     6     9     3

If you prefer, you can then re-order the columns back to the original using

out = out[['col1', 'col2', 'col3']]
Nick
  • 138,499
  • 22
  • 57
  • 95
6

You can use np.repeat and np.tile to get the expected output:

import numpy as np

N = 3
cols_to_repeat = ['col1']  # 1, 1, 1, 2, 2, 2
cols_to_tile = ['col2', 'col3']  # 1, 2, 1, 2, 1, 2

data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
                       np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

Output:

>>> out
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

You can create a generic function:

def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
    to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()

    assert df.columns.difference(to_repeat + to_tile).empty, "all columns should be repeated or tiled"

    data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
                           np.repeat(df[to_repeat].values, N, axis=0)], axis=1)

    return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]

repeat(df, ['col1'])

Usage:

>>> repeat(df, ['col1'])
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9
Corralien
  • 109,409
  • 8
  • 28
  • 52
6

Another possible solution, which is based on itertools.product:

from itertools import product

pd.DataFrame([[x, y[0], y[1]] for x, y in 
              product(df['col1'], zip(df['col2'], df['col3']))], 
             columns=df.columns)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9
PaulS
  • 21,159
  • 2
  • 9
  • 26
6

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor 
import pandas as pd

df.complete('col1', ('col2','col3'))
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

complete primarily is for exposing missing rows - the output above just happens to be a nice side effect. A more appropriate, albeit quite verbose option is expand_grid:

# pip install pyjanitor
import janitor as jn
import pandas as pd

others = {'df1':df.col1, 'df2':df[['col2','col3']]}
jn.expand_grid(others=others).droplevel(axis=1,level=0)
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     8
3     2     4     7
4     2     5     8
5     2     6     8
6     2     4     7
7     2     5     8
8     2     6     8
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 3
    It's nice and short for the non-duplicates case but unfortunately doesn't handle the other case correctly. – mozway May 26 '23 at 09:00