I'm trying to leverage Pandas's groupby add a modified subgroup dataframe to a list. I thought groupby's are like a map-reduce operation where it divides the dataframe into separate smaller frames to apply some function and then collect the function results to aggregate.
import pandas as pd
testdf = pd.DataFrame({
"a": [1, 2, 3, 4, 1, 2, 3],
'b': [-1, -2, -3, -4, -1, -2, -3]
})
def foo(input_frame, ref_list):
ref_list.append(input_frame.copy())
### Map Reduce style adding to a list
testlist = []
testdf.groupby('a').apply(lambda row: foo(row, testlist))
test_concat_df = pd.concat(testlist)
print(test_concat_df)
#test_concat_df
# This results in 9 rows where the original dataframe has 7 rows.
# a b
# 0 1 -1
# 4 1 -1
# 0 1 -1
# 4 1 -1
# 1 2 -2
# 5 2 -2
# 2 3 -3
# 6 3 -3
# 3 4 -4
In this above code: I'm getting 2 extra rows. Why is this?
#Add to a list in a sequence, blocking
testlist2 = []
for name, group in testdf.groupby('a'):
foo(name, group)
test_concat_df2 = pd.concat(testlist2)
print(test_concat_df2)
#test_concat_df2
# This is the original dataframe. Given the index values, this is not in order
# a b
# 0 1 -1
# 4 1 -1
# 1 2 -2
# 5 2 -2
# 2 3 -3
# 6 3 -3
# 3 4 -4
If I do this in sequence, everything is fine.