2

Trying to create a new dataframe first spliting the original one in two:

df1 - that contains only rows from original frame which in selected colomn has values from a given list

df2 - that contains only rows from original which in selected colomn has other values, with these values then changed to a new given value.

Return new dataframe as concatenation of df1 and df2

This works fine:

l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
print(df)

 cat  val
0   a    1
1   b    2
2   c    3
3   d    4
4   a    5
5   b    6

df['cat'] = df['cat'].apply(lambda x: 'other')
print(df)

     cat  val
0  other    1
1  other    2
2  other    3
3  other    4
4  other    5
5  other    6

Yet when I define function:

def create_df(df, select, vals, other):
    df1 = df.loc[df[select].isin(vals)]
    df2 = df.loc[~df[select].isin(vals)]
    df2[select] = df2[select].apply(lambda x: other)
    result = pd.concat([df1, df2])
    return result

and call it:

df3 = create_df(df, 'cat', ['a','b'], 'xxx')
print(df3)

Which results in what I actually need:

   cat  val
0    a    1
1    b    2
4    a    5
5    b    6
2  xxx    3
3  xxx    4

And for some reason in this case I get a warning:

..\usr\conda\lib\site-packages\ipykernel\__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

So how this case (when I assign value to a column in a function) is different from the first one, when I assign value not in a function?

What is the right way to change column value?

dokondr
  • 3,389
  • 12
  • 38
  • 62
  • 3
    Possible duplicate of [How to deal with SettingWithCopyWarning in Pandas?](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) – Eric Aug 05 '17 at 20:32
  • Notwithstanding, the strange thing is that in my case this feature shows differently in different parts of the code: I get a warning in function definition but not in the main program. Why is that? – dokondr Aug 05 '17 at 21:13

1 Answers1

1

Well there are many ways that code can be optimized I guess but for it to work you could simply save copies of the input dataframe and concat those:

def create_df(df, select, vals, other):
    df1 = df.copy()[df[select].isin(vals)] #boolean.index
    df2 = df.copy()[~df[select].isin(vals)] #boolean-index
    df2[select] = other # this is sufficient
    result = pd.concat([df1, df2])
    return result

Alternative version:

l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})

# define a mask 
mask = df['cat'].isin(list("ab"))

# concatenate mask, nonmask
df2 = pd.concat([df[mask],df[-mask]])

# change values to 'xxx'
df2.loc[-mask,["cat"]] = "xxx"

Outputs

    cat val
0   a   1
1   b   2
4   a   5
5   b   6
2   xxx 3
3   xxx 4

Or function:

def create_df(df, filter_, isin_, value):

    # define a mask 
    mask = df[filter_].isin(isin_)

    # concatenate mask, nonmask
    df = pd.concat([df[mask],df[-mask]])

    # change values to 'xxx'
    df.loc[-mask,[filter_]] = value

    return df

df2 = create_df(df, 'cat', ['a','b'], 'xxx')
df2
Anton vBR
  • 18,287
  • 5
  • 40
  • 46