137

i have a small sample data:

import pandas as pd

df = {'ID': [3009, 129, 119, 120, 121, 122, 130, 3014, 266, 849, 174, 844],
  'V': ['IGHV7-B*01', 'IGHV7-B*01', 'IGHV6-A*01', 'GHV6-A*01', 'IGHV6-A*01',
        'IGHV6-A*01', 'IGHV4-L*03', 'IGHV4-L*03', 'IGHV5-A*01', 'IGHV5-A*04',
        'IGHV6-A*02','IGHV6-A*02'],
  'Prob': [1, 1, 0.8, 0.8056, 0.9, 0.805, 1, 1, 0.997, 0.401, 1, 1]}

df = pd.DataFrame(df)

looks like

df    

Out[25]: 
      ID    Prob           V
0    3009  1.0000  IGHV7-B*01
1     129  1.0000  IGHV7-B*01
2     119  0.8000  IGHV6-A*01
3     120  0.8056  IGHV6-A*01
4     121  0.9000  IGHV6-A*01
5     122  0.8050  IGHV6-A*01
6     130  1.0000  IGHV4-L*03
7    3014  1.0000  IGHV4-L*03
8     266  0.9970  IGHV5-A*01
9     849  0.4010  IGHV5-A*04
10    174  1.0000  IGHV6-A*02
11    844  1.0000  IGHV6-A*02

I want to split the column 'V' by the '-' delimiter and move it to another column named 'allele'

    Out[25]: 
      ID    Prob      V    allele
0    3009  1.0000  IGHV7    B*01
1     129  1.0000  IGHV7    B*01
2     119  0.8000  IGHV6    A*01
3     120  0.8056  IGHV6    A*01
4     121  0.9000  IGHV6    A*01
5     122  0.8050  IGHV6    A*01
6     130  1.0000  IGHV4    L*03
7    3014  1.0000  IGHV4    L*03
8     266  0.9970  IGHV5    A*01
9     849  0.4010  IGHV5    A*04
10    174  1.0000  IGHV6    A*02
11    844  1.0000  IGHV6    A*02

the code i have tried so far is incomplete and didn't work:

df1 = pd.DataFrame()
df1[['V']] = pd.DataFrame([ x.split('-') for x in df['V'].tolist() ])

or

df.add(Series, axis='columns', level = None, fill_value = None)
newdata = df.DataFrame({'V':df['V'].iloc[::2].values, 
                        'Allele': df['V'].iloc[1::2].values})
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
Jessica
  • 2,923
  • 8
  • 25
  • 46

3 Answers3

219

Use vectoried str.split with expand=True:

In [42]:
df[['V','allele']] = df['V'].str.split('-',expand=True)
df

Out[42]:
      ID    Prob      V allele
0   3009  1.0000  IGHV7   B*01
1    129  1.0000  IGHV7   B*01
2    119  0.8000  IGHV6   A*01
3    120  0.8056   GHV6   A*01
4    121  0.9000  IGHV6   A*01
5    122  0.8050  IGHV6   A*01
6    130  1.0000  IGHV4   L*03
7   3014  1.0000  IGHV4   L*03
8    266  0.9970  IGHV5   A*01
9    849  0.4010  IGHV5   A*04
10   174  1.0000  IGHV6   A*02
11   844  1.0000  IGHV6   A*02
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 3
    I tried this, but I wanted to separate my column into three columns. df[['ColumnTable','Database','Schema','Table']] = df['ColumnTable'].str.split('.',expand=True) print(df) I received an error message: File "C:\Users\xxx\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3189, in _setitem_array raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key – DataGirl Jun 29 '21 at 13:34
  • 3
    @DataGirl I had the same problem. Also tried a list comprehension. Root of the problem was my values consisted of tuples. Converting them to strings solved everything. So `df[["x", "y"]] = df["x, y"].astype("string").str.split(", ", expand=True)` – marianoju Jan 28 '22 at 12:56
  • @marianoju - I'll give that a try. – DataGirl Feb 03 '22 at 20:06
  • Is there a way to handle the case where the value of the column is NaN (considering we want to keep NaN values)? – Jasmine Latendresse Feb 08 '22 at 17:44
  • Using this as written in my code gave me the setting with copy warning. df[['V','allele']] = df['V'].str.split('-',expand=True).copy() worked for me. – Liz Oct 24 '22 at 13:28
43

For storing data into a new dataframe use the same approach, just with the new dataframe:

tmpDF = pd.DataFrame(columns=['A','B'])
tmpDF[['A','B']] = df['V'].str.split('-', expand=True)

Eventually (and more usefull for my purposes) if you would need get only a part of the string value (i.e. text before '-'), you could use .str.split(...).str[idx] like:

df['V'] = df['V'].str.split('-').str[0]
df
    ID      V       Prob
0   3009    IGHV7   1.0000
1   129     IGHV7   1.0000
2   119     IGHV6   0.8000
3   120     GHV6    0.8056

- splits 'V' values into list according to separator '-' and stores 1st item back to the column

Lukas
  • 2,034
  • 19
  • 27
  • If `df['V'].str.split('-')` returns a list, why do we need to use a string accessor in order to index the list components? My assumption was `df['V'].str.split('-')[0]` would be all that is needed. – lakeside Jan 24 '23 at 19:29
6

Use the below:

df['allele'] = [x.split('-')[-1] for x in df['V']]

The above first part retains any values after the '-' sign

df['V'] = [x.split('-')[-0] for x in df['V']]

The above second part retains any values before the '-' sign and automatically replaces the main column

df.head(3)
Ariel Jumba
  • 61
  • 1
  • 2
  • I see this is your first contribution, thank you for your answer. If you could expand it with an explanation of how your approach works and why you chose this method that would greatly improve it. Currently it's unlikely to get much appreciation or might even just be deleted. – Bracken Oct 29 '21 at 15:24
  • This worked for me. But, I wonder if there is a more efficient way of doing this and avoiding loop. – Janzaib M Baloch Mar 02 '23 at 10:44