0

I'm trying to leverage Pandas's groupby add a modified subgroup dataframe to a list. I thought groupby's are like a map-reduce operation where it divides the dataframe into separate smaller frames to apply some function and then collect the function results to aggregate.

    import pandas as pd

    testdf = pd.DataFrame({

        "a": [1, 2, 3, 4, 1, 2,  3],
        'b': [-1, -2, -3, -4, -1, -2, -3]

    })

def foo(input_frame, ref_list):
    ref_list.append(input_frame.copy())


### Map Reduce style adding to a list
testlist = []
testdf.groupby('a').apply(lambda row: foo(row, testlist))
test_concat_df = pd.concat(testlist)

print(test_concat_df)
#test_concat_df
# This results in 9 rows where the original dataframe has 7 rows.
#    a   b
# 0  1  -1
# 4  1  -1
# 0  1  -1
# 4  1  -1
# 1  2  -2
# 5  2  -2
# 2  3  -3
# 6  3  -3
# 3  4  -4

In this above code: I'm getting 2 extra rows. Why is this?

#Add to a list in a sequence, blocking
testlist2 = []
for name, group in testdf.groupby('a'):
    foo(name, group)

test_concat_df2 = pd.concat(testlist2)

print(test_concat_df2)
#test_concat_df2
# This is the original dataframe. Given the index values, this is not in order
#    a  b
# 0  1 -1
# 4  1 -1
# 1  2 -2
# 5  2 -2
# 2  3 -3
# 6  3 -3
# 3  4 -4

If I do this in sequence, everything is fine.

Ed Z
  • 21
  • 3

0 Answers0