0

I have a column in my CSV file in which I would like to search for list of strings and add a new column of 0/1, if any value from the list is present then 1 else 0.

I have two lists :

  1. 'UC''iCD', 'Chrons disease', 'Chrons', 'IBD', 'Ulcerative colitis', 'PMC', 'P80', 'Chron disease'
  2. Donor, healthy, non-IBD, Control.

My column also has NA values

By far I have this in which I was just trying to match list of stings:

import csv
import pandas as pd

with open('biosample.csv') as csvfile:
    df = pd.read_csv('biosample.csv', delimiter = ',', dtype= 'unicode', 
    error_bad_lines=False)
    df1 = df.set_index(['Sample_Info'])
print(df1.loc['UC''iCD', 'Chrons disease', 'Chrons', 'IBD', 'Ulcerative 
colitis', 'PMC', 'P80', 'Chron disease])

To this I am getting multiple errors like in _has_valid_type_error,in has_valid_type_error.

I have gone through already posted questions but in none this kind of errors are mentioned.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
K.S
  • 113
  • 13

2 Answers2

1

Demo:

In [84]: df
Out[84]:
   a   b    c    new
0  1  11  aaa   True
1  2  22  bbb  False
2  3  33  ccc   True
3  4  44  ddd  False

In [85]: lst = ['aaa','ccc','xxx']

In [86]: df['new'] = df['c'].isin(lst).astype(np.int8)

In [87]: df
Out[87]:
   a   b    c  new
0  1  11  aaa    1
1  2  22  bbb    0
2  3  33  ccc    1
3  4  44  ddd    0

PS you don't need to use CSV module at all:

df = pd.read_csv(r'/path/to/biosample.csv', delimiter = ',', 
                 encoding='unicode', error_bad_lines=False, 
                 index_col='Sample_Info')
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Hi MaxU this is not helping me because my the column in my csv sheet has values like this : HC,HC UC,UC CC,CC UC,UC UC,UC LC,LC CD,ICD_r no IBS,No,no IBD Donor no IBS,Yes,Ulcerative colitis no IBS,Yes,Ulcerative colitis no IBS,No,no IBD NA remission,Crohn Disease,pathological NA remission,Ulcerative Colitis,pathological NA remission,Ulcerative Colitis,pathological – K.S Nov 02 '17 at 13:09
  • @K.S, can you post a small reproducible data set and your desired data set. Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine Nov 02 '17 at 14:03
  • My column contains multiple string seperated by comma. The df is taking only the first value. For example the cell JN1 has value "UC,UC", JN2 - "remission,No IBS, No IBD" JN3 - "Healthy,normal', JN4 - "Chrons Disease, IBD". I had these values in multiple column which I combined in one column using this: new = df1.apply(lambda x: ','.join(x.dropna()), axis=1) Now I want new column which has corresponding value of 0 for Chrons disease as 0 and healthy as 1. – K.S Nov 02 '17 at 16:47
0

You don't need to use csv module while loading dataframe from csv file.

As you mentioned new column should be added to dataframe.

The code for checking values to be from the first list may be like this:

import pandas as pd

list1 = ['UC''iCD', 'Chrons disease', 'Chrons', 'IBD', 'Ulcerative colitis', 'PMC', 'P80', 'Chron disease']
list2 = ['Donor', 'healthy', 'non-IBD', 'Control']

def check_list(value, list2check):
    if any(map(lambda x: x in value, list2check))
        return 1
    return 0

df = pd.read_csv('biosample.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False)
df['sample_from_list1'] = df['Sample_Info'].apply(lambda v: check_list(v, list1))
Denis Shatov
  • 91
  • 1
  • 5
  • Hi MaxU this is not helping me because my the column in my csv sheet has values like this : Sample_Info HC,HC UC,UC CC,CC UC,UC UC,UC LC,LC CD,ICD_r no IBS,No,no IBD Donor no IBS,Yes,Ulcerative colitis no IBS,Yes,Ulcerative colitis no IBS,No,no IBD NA remission,Crohn Disease,pathological NA remission,Ulcerative Colitis,pathological NA remission,Ulcerative Colitis,pathological – K.S Nov 02 '17 at 13:15