2

I am attempting to split a column and keep only the third item as the column value using the following

df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

I have also tried these variations

df1['gene_name'] = df1.iloc[:,'gene_name'].str.split(';', expand=True)[2]

df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

df1['gene_name'] = df1['gene_name'].str.split(';', expand=True)[2]

df1['gene_name'] = df1.gene_name.str.split(';', expand=True)[2]

But it always returns this warning

find_target_genes.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]

I have also tried using 4 (column index) instead of gene_names but this results in an error.

How can I make this work? I've looked through the documentation but I don't think I am fully understanding it since I can't figure out whats wrong.

Here is an example of 2 of the columns I am trying to split (yes this is all in one column):

ID "A" ; version "B" ; name "C" ; source "D' ;  transcript "C"
ID "A1" ; version "B1" ; name "C1" ; source "D1" ;  transcript "C1"

I would like the column to say name "C" only and get rid of the rest

keenan
  • 462
  • 3
  • 12
  • How about putting `df1 = df1.copy()` right after creation of `df1`? – Quang Hoang Jun 26 '20 at 23:25
  • I'm not sure I know what you mean – keenan Jun 27 '20 at 00:02
  • Does this answer your question? [How to deal with SettingWithCopyWarning in Pandas?](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) – Trenton McKinney Jun 27 '20 at 01:14
  • I'm still not clear on how to use this and having trouble following the linked post. I tried putting this as a separate line above my code and adding `.copy()` on to the end of my code. Both ways returned the same error. – keenan Jun 27 '20 at 03:56

3 Answers3

3

The problem is not on the right side of the assignment, it is in the left one. You are using df1['gene_name'] instead of df1.loc[:,'gene_name'] as is recommended on the User Guide. Using your assignment, "it’s very hard to predict whether it will return a view or a copy". Depending on the "memory layout of the array" bad things can happen. So, you should be doing:

df1.loc[:,'gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]
Fergui
  • 93
  • 5
  • Any of the options give me an error using pandas 1.0.3. Do you have a sample that gives you the error to test? – Fergui Jun 27 '20 at 14:41
  • I have edited the question to include an example of the column I am attempting to split. I did not include the other columns in the data set but I don't think they are relevant to the error. – keenan Jun 27 '20 at 21:42
  • I have tried without error `df1 = DataFrame([['ID \"A\" ; version \"B\" ; name \"C\" ; source \"D\" ; transcript \"C\"'],['ID \"A1\" ; version \"B1\" ; name \"C1\" ; source \"D1\" ; transcript \"C1\"']],columns=['gene_name'])` and `df1['gene_name'] = df1.loc[:,'gene_name'].str.split(';', expand=True)[2]` – Fergui Jun 28 '20 at 22:07
  • Thanks for Fregui. In my case, the warning disappeared when I replace it from both sides of the equation so "df.loc used in both sides...df.loc[:,'column_name']". – HassanSh__3571619 Sep 01 '21 at 20:43
0

I believe when you use .assign(), that warning goes away. See code below:

df.assign(gene_name = df['gene_name'].str.split(';').str[2])
rhug123
  • 7,893
  • 1
  • 9
  • 24
0

In my case, the warning disappeared when I replace it from both sides of the equation so "df.loc used in both sides...df.loc[:,'column_name']".

HassanSh__3571619
  • 1,859
  • 1
  • 19
  • 18