54

I'm new to using pandas and am writing a script where I read in a dataframe and then do some computation on some of the columns.

Sometimes I will have the column called "Met":

df = pd.read_csv(File, 
  sep='\t', 
  compression='gzip', 
  header=0, 
  names=["Chrom", "Site", "coverage", "Met"]
)

Other times I will have:

df = pd.read_csv(File, 
  sep='\t', 
  compression='gzip', 
  header=0, 
  names=["Chrom", "Site", "coverage", "freqC"]
)

I need to do some computation with the "Met" column so if it isn't present I will need to calculate it using:

df['Met'] = df['freqC'] * df['coverage'] 

is there a way to check if the "Met" column is present in the dataframe, and if not add it?

David Medinets
  • 5,160
  • 3
  • 29
  • 42
user2165857
  • 2,530
  • 7
  • 27
  • 39

4 Answers4

91

You check it like this:

if 'Met' not in df:
    df['Met'] = df['freqC'] * df['coverage'] 
YS-L
  • 14,358
  • 3
  • 47
  • 58
  • See https://stackoverflow.com/a/62449676/14555505 for how to add in multiple columns iff they don't exist – beyarkay Jul 10 '23 at 08:45
8

When interested in conditionally adding columns in a method chain, consider using pipe() with a lambda:

df.pipe(lambda d: (
    d.assign(Met=d['freqC'] * d['coverage'])
    if 'Met' not in d else d
))
5

If you were creating the dataframe from scratch, you could create the missing columns without a loop merely by passing the column names into the pd.DataFrame() call:

cols = ['column 1','column 2','column 3','column 4','column 5']
df = pd.DataFrame(list_or_dict, index=['a',], columns=cols)
autonopy
  • 429
  • 8
  • 12
4

Alternatively you can use get:

df['Met'] = df.get('Met', df['freqC'] * df['coverage'])    

If the column Met exists, the values inside this column are taken. Otherwise freqC and coverage are multiplied.

rachwa
  • 1,805
  • 1
  • 14
  • 17
  • I think this solution is correct but it's not as efficient as the others because the assignation is always done and the product is always done, as well. – zk82 Oct 04 '22 at 10:00
  • **EDIT**: In fact it may fail always since the DataFrame either has `Met` or `freqC` but not both so in order to be correct you should do something like `df['Met'] = df.get('Met', df.get('freqC') * df['coverage'])` (notice the new get for `freqC`) – zk82 Oct 04 '22 at 10:07