The Isin function within a loop doesn't work correctly

Question

I have a dataframe df like

When I filter the datsframe to make operation like len and sum everything works correctly, like here

new = df.x.isin([110,111])
df[new]
len(df[new].y) = 5
sum(df[new].y) = 2

However when I invoke the isin function inside a loop it doesn't work correctly.

I have second dataframe df0 like

col1 . col2

a     110,111
b     113
c     114,1114
d     267,118
e     956

and I want to iterate over df0 and do operation len and sum invoking group gr of element of df.x from df0.col2 like I do in this loop

for i in df0.index:
    gr = df0.get_value(i, 'col2')
    new = df.x.isin([gr])
    df_size = len(df[new].y)
    df_sum = sum(df[new].y)

the issue is that in the group gr=110,111 the element 111 is ignored

so the df_size = 3 and df_sum = 1 when instead they should be 5 and 2

I have to iterate over each groups of the col2 in the dataframe df0 — Annalix, Dec 13 '19 at 18:16
okay can you post the df0 in the same format as df, even better would be providing a dataframe constructor and expected output — anky, Dec 13 '19 at 18:17
what does mean the same format. They example of dataframe I am using already. My question is why the isin function doesn-t work in a loop — Annalix, Dec 13 '19 at 18:25
meaning , i am able to reproduce the first df using `pd.read_clipboard()` , but cant for the second one — anky, Dec 13 '19 at 18:27
I am not able to reproduce the second dataframe, I already have it. Use pd.DataFrame({'col1':...}, {col2:...}) dictionaries to reproduce the same that you see in the example — Annalix, Dec 13 '19 at 18:32
Please take a look at [How to make good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for help with making sample data that is easy to reproduce, to help us to help you more easily — G. Anderson, Dec 13 '19 at 18:33
I have encountered issues where `in` does not work because of float numbers errors. Is that the same issue here? — Guimoute, Dec 13 '19 at 18:34
@G.Anderson thanks, I have just explained how to make the dataframe in a previous comment. You can do it yourself. The problem of the topic is another one. — Annalix, Dec 13 '19 at 18:40
@Guimoute there is any float number. It doesn't recognise the list of numbers like [110,111] but only single numbers or the first one in the list 110 — Annalix, Dec 13 '19 at 18:42

Valdi_Bo · Accepted Answer · 2019-12-14T17:47:09.083

1

Look at the first line of your first code sample:

new = df.x.isin([110,111])

The argument of isin is a list.

Then look at df.x.isin([gr]) in the second code sample and note that if gr is e.g. '111,112' (a string) then [gr] contains ['111,112'], i.e. a list containing a single element. The fact that you "enveloped" gr with square brackets does not split gr.

One of possible to cope with it solutions is to convert col2 the following way:

df0.col2 = df0.col2.str.split(',')

so that each element contains also a list (not a string).

Then change the second code sample to:

for _, row in df0.iterrows():
    new = df[df.x.isin(row.col2)]
    df_size = new.y.size
    df_sum = new.y.sum()
    print(row.col2, df_size, df_sum)

In the final version replace print with whatever you want to do with these variables.

edited Dec 14 '19 at 17:47

answered Dec 14 '19 at 17:40

Valdi_Bo

30,023
4
23
41

Thanks, will have a look. The only thing interrow is very slow, that't why I iterate with get_values and the gr variable. Do you think is is the same? – Annalix Dec 14 '19 at 21:35
*iterrows* should work quicker than your loop over the index. Note that in this case you perform an "individual search" for a row with the current index, what takes some time. But when you use *iterrows*, you already have the current row retrieved (no need to search it again). – Valdi_Bo Dec 15 '19 at 10:43

The Isin function within a loop doesn't work correctly

1 Answers1