I have a function that is getting passed a pandas dataframe, and for each row in that dataframe I'd like to create N other rows, each of which are equivalent to the original row except for 2 column values.
What's the right way to do this - especially in a RAM-effective manner?
My attempts so far have been to run pd.apply
, and then for each of those rows call a function that returns a list of pd.Series
objects that I would then call append
on to add them to the original DataFrame
. This hasn't worked out, though.
Here is an example I tried with some dummy code to replicate:
students = [ ('Jack', 34, 'Sydney' , 'Australia') ,
('Jill', 30, 'New York' , 'USA' ) ]
# Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name' , 'Age', 'City' , 'Country'], index=['a', 'b', 'c' , 'd' , 'e' , 'f'])
# function I will use to explode a single row into 10 new rows
def replicate(x):
new_rows = []
i = 0
for j in range(3):
y = x.copy(deep=True)
y.Age = i
i += 1
new_rows.append(y)
return new_rows
# Iterate over each row and append the results
df.apply(lambda x: df.append(replicate(x))
For the above, I'd expect output like the following:
Jack, 34, Sydney, Australia
Jack, 0, Sydney, Australia
Jack, 1, Sydney, Australia
Jack, 2, Sydney, Australia
Jill, 30, New York, USA
Jill, 0, New York, USA
Jill, 1, New York, USA
Jill, 2, New York, USA
In the end, I'd like my dataframe to have N times as many rows, where I can compute derived rows from the original rows. I'd like to do this in a space effective manner, and this isn't happening right now. Any help is appreciated!