2

I have a Pandas Dataframe that looks similar to this

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| QK |....|time|

and I have to group it by column C2 do some filtering on each group and store the results in a separate file for each group.

Grouped Dataframe:

Subset 1:

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| AE |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| AE |....|time|

Subset 2

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| FB |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| FB |....|time|


and so on.

My current approach looks similar to this

def my_filter_function(self, df):
   result = df[df["C1"].notna() & df["Cn"] != 'Some value']
   pd.to_csv(...)


df = pd.read_csv(...)

df.groupby("C2").apply(lambda x: self.my_filter_function(x))

My problem now is that Pandas calls the apply method twice on the first group as mentioned here, here and in the docs. So the file for the first group would be stored twice. Is there any way to avoid this or do you have any suggestion for another approach? Is it possible to keep the grouping after the apply method?

Regards

pichlbaer
  • 923
  • 1
  • 10
  • 18

2 Answers2

0

Why not putting the

pd.to_csv(...)

after

df = df.groupby("C2").apply(lambda x: self.my_filter_function(x))

instead of inside my_filter_function? This way you would avoid the effects of the apply method behavior.

Filipe Aleixo
  • 3,924
  • 3
  • 41
  • 74
0

You can loop by groupby object for avoid calling first group twice:

for name, group in df.groupby("C2"):
    result = group[group["C1"].notna() & (group["Cn"] != 'Some value')]
    result.to_csv(...)

Sample:

df = pd.DataFrame({
         'D':[1,3,5,7,1,0],
         'E':[5,3,6,9,2,4],
         'C2':list('aaabbb')
})

for name, group in df.groupby("C2"):
    print (group)

   D  E C2
0  1  5  a
1  3  3  a
2  5  6  a
   D  E C2
3  7  9  b
4  1  2  b
5  0  4  b
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hey...thanks for that answer. I know that I can do that...but I hate looping through stuff. But if there is no other way to accomplish this I'll have to do it – pichlbaer Jan 16 '19 at 11:24
  • @pichlbaer - `apply` are loops under the hood also, I have no another idea for avoid it. Call first group is per design of pandas :( – jezrael Jan 16 '19 at 11:26