Performing str.split on a column in a dataframe returns a SettingWithCopyWarning

Question

I am attempting to split a column and keep only the third item as the column value using the following

df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

I have also tried these variations

df1['gene_name'] = df1.iloc[:,'gene_name'].str.split(';', expand=True)[2]

df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

df1['gene_name'] = df1['gene_name'].str.split(';', expand=True)[2]

df1['gene_name'] = df1.gene_name.str.split(';', expand=True)[2]

But it always returns this warning

find_target_genes.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

I have also tried using 4 (column index) instead of gene_names but this results in an error.

How can I make this work? I've looked through the documentation but I don't think I am fully understanding it since I can't figure out whats wrong.

Here is an example of 2 of the columns I am trying to split (yes this is all in one column):

ID "A" ; version "B" ; name "C" ; source "D' ;  transcript "C"
ID "A1" ; version "B1" ; name "C1" ; source "D1" ;  transcript "C1"

I would like the column to say name "C" only and get rid of the rest

How about putting `df1 = df1.copy()` right after creation of `df1`? — Quang Hoang, Jun 26 '20 at 23:25
Does this answer your question? [How to deal with SettingWithCopyWarning in Pandas?](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) — Trenton McKinney, Jun 27 '20 at 01:14
I'm still not clear on how to use this and having trouble following the linked post. I tried putting this as a separate line above my code and adding `.copy()` on to the end of my code. Both ways returned the same error. — keenan, Jun 27 '20 at 03:56

score 3 · Answer 1 · answered Jun 27 '20 at 00:34

3

The problem is not on the right side of the assignment, it is in the left one. You are using df1['gene_name'] instead of df1.loc[:,'gene_name'] as is recommended on the User Guide. Using your assignment, "it’s very hard to predict whether it will return a view or a copy". Depending on the "memory layout of the array" bad things can happen. So, you should be doing:

df1.loc[:,'gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

answered Jun 27 '20 at 00:34

Fergui

93
5

Any of the options give me an error using pandas 1.0.3. Do you have a sample that gives you the error to test? – Fergui Jun 27 '20 at 14:41
I have edited the question to include an example of the column I am attempting to split. I did not include the other columns in the data set but I don't think they are relevant to the error. – keenan Jun 27 '20 at 21:42
I have tried without error `df1 = DataFrame([['ID \"A\" ; version \"B\" ; name \"C\" ; source \"D\" ; transcript \"C\"'],['ID \"A1\" ; version \"B1\" ; name \"C1\" ; source \"D1\" ; transcript \"C1\"']],columns=['gene_name'])` and `df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]` – Fergui Jun 28 '20 at 22:07
Thanks for Fregui. In my case, the warning disappeared when I replace it from both sides of the equation so "df.loc used in both sides...df.loc[:,'column_name']". – HassanSh__3571619 Sep 01 '21 at 20:43

score 0 · Answer 2 · answered Jun 27 '20 at 00:17

0

I believe when you use .assign(), that warning goes away. See code below:

df.assign(gene_name = df['gene_name'].str.split(';').str[2])

answered Jun 27 '20 at 00:17

rhug123

7,893
1
9
24

that is because ``assign`` creates a new dataframe – sammywemmy Jun 27 '20 at 00:44
hmm this runs fine but doesn't actually preform the split. any idea why? – keenan Jun 27 '20 at 02:48

score 0 · Answer 3 · answered Sep 01 '21 at 20:46

0

In my case, the warning disappeared when I replace it from both sides of the equation so "df.loc used in both sides...df.loc[:,'column_name']".

answered Sep 01 '21 at 20:46

HassanSh__3571619

1,859
1
19
18

Performing str.split on a column in a dataframe returns a SettingWithCopyWarning

3 Answers3