0

I have a dataframe df like

 X      Y

110     0
110     0
110     1
111     1
111     0
112     1
113     1
114     0

When I filter the datsframe to make operation like len and sum everything works correctly, like here

new = df.x.isin([110,111])
df[new]
len(df[new].y) = 5
sum(df[new].y) = 2

However when I invoke the isin function inside a loop it doesn't work correctly.

I have second dataframe df0 like

col1 . col2

a     110,111
b     113
c     114,1114
d     267,118
e     956

and I want to iterate over df0 and do operation len and sum invoking group gr of element of df.x from df0.col2 like I do in this loop

for i in df0.index:
    gr = df0.get_value(i, 'col2')
    new = df.x.isin([gr])
    df_size = len(df[new].y)
    df_sum = sum(df[new].y)

the issue is that in the group gr=110,111 the element 111 is ignored

so the df_size = 3 and df_sum = 1 when instead they should be 5 and 2

Annalix
  • 470
  • 2
  • 6
  • 17
  • 2
    what is the need of a for loop? – anky Dec 13 '19 at 18:14
  • I have to iterate over each groups of the col2 in the dataframe df0 – Annalix Dec 13 '19 at 18:16
  • okay can you post the df0 in the same format as df, even better would be providing a dataframe constructor and expected output – anky Dec 13 '19 at 18:17
  • what does mean the same format. They example of dataframe I am using already. My question is why the isin function doesn-t work in a loop – Annalix Dec 13 '19 at 18:25
  • meaning , i am able to reproduce the first df using `pd.read_clipboard()` , but cant for the second one – anky Dec 13 '19 at 18:27
  • I am not able to reproduce the second dataframe, I already have it. Use pd.DataFrame({'col1':...}, {col2:...}) dictionaries to reproduce the same that you see in the example – Annalix Dec 13 '19 at 18:32
  • 1
    Please take a look at [How to make good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for help with making sample data that is easy to reproduce, to help us to help you more easily – G. Anderson Dec 13 '19 at 18:33
  • I have encountered issues where `in` does not work because of float numbers errors. Is that the same issue here? – Guimoute Dec 13 '19 at 18:34
  • @G.Anderson thanks, I have just explained how to make the dataframe in a previous comment. You can do it yourself. The problem of the topic is another one. – Annalix Dec 13 '19 at 18:40
  • @Guimoute there is any float number. It doesn't recognise the list of numbers like [110,111] but only single numbers or the first one in the list 110 – Annalix Dec 13 '19 at 18:42

1 Answers1

1

Look at the first line of your first code sample:

new = df.x.isin([110,111])

The argument of isin is a list.

Then look at df.x.isin([gr]) in the second code sample and note that if gr is e.g. '111,112' (a string) then [gr] contains ['111,112'], i.e. a list containing a single element. The fact that you "enveloped" gr with square brackets does not split gr.

One of possible to cope with it solutions is to convert col2 the following way:

df0.col2 = df0.col2.str.split(',')

so that each element contains also a list (not a string).

Then change the second code sample to:

for _, row in df0.iterrows():
    new = df[df.x.isin(row.col2)]
    df_size = new.y.size
    df_sum = new.y.sum()
    print(row.col2, df_size, df_sum)

In the final version replace print with whatever you want to do with these variables.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Thanks, will have a look. The only thing interrow is very slow, that't why I iterate with get_values and the gr variable. Do you think is is the same? – Annalix Dec 14 '19 at 21:35
  • *iterrows* should work quicker than your loop over the index. Note that in this case you perform an "individual search" for a row with the current index, what takes some time. But when you use *iterrows*, you already have the current row retrieved (no need to search it again). – Valdi_Bo Dec 15 '19 at 10:43