How to add column to a data frame?

Question

I have the following code:

db_fields = ("id", "email", "status", "source")
df = DataFrame(results)
for col in db_fields:
    if col not in df.columns:
          COLUMN IS MISSING - COMMAND TO ADD COLUMN

If for example status column is missing it needs to be added to the data frame with nothing as value so when I export the df to csv I will always have the same schema of fields.

I know that to remove column I should do:

df = df.drop(col, 1)

But I don't know what is the best way to add column with empty value.

Possible duplicate of [Adding new column to existing DataFrame in Python pandas](https://stackoverflow.com/questions/12555323/adding-new-column-to-existing-dataframe-in-python-pandas) — qdread, Nov 26 '18 at 14:28

Mohit Motwani · Answer 1 · 2018-11-26T13:01:32.487

1

This method will added status column with Null values:

import numpy as np
df['status'] = np.nan

Alternatively:

df['status'] = None

So:

db_fields = ("id", "email", "status", "source")
for col in db_fields:
    if col not in df.columns:
        df[col] = None

edited Nov 26 '18 at 13:01

answered Nov 26 '18 at 12:54

Mohit Motwani

4,662
3
17
45

1

Would df[col] = np.nan work? and why do I need numpy for the none value? – Luis Nov 26 '18 at 12:54

jezrael · Accepted Answer · 2018-11-26T14:08:10.433

1

You can create array of non exist columns and create new one with assign and dictionary:

df = pd.DataFrame({'id': ['a1','a2', 'b1'],
                  'a': ['a1','a2', 'b1'],
                  'source': ['a1','a2', 'b1']})
print (df)
   id   a source
0  a1  a1     a1
1  a2  a2     a2
2  b1  b1     b1

db_fields = ("id", "email", "status", "source")

#get missing columns
diff = np.setdiff1d(np.array(db_fields), df.columns)
print (diff)
['email' 'status']

#get original columns not existed in db_fields
diff1 = np.setdiff1d(df.columns, np.array(db_fields)).tolist()
print (diff1)
['a']

#add missing columns with change order
d = dict.fromkeys(diff, np.nan)
df = df.assign(**d)[diff1 + list(db_fields)]
print (df)
    a  id  email  status source
0  a1  a1    NaN     NaN     a1
1  a2  a2    NaN     NaN     a2
2  b1  b1    NaN     NaN     b1

#if necessary first db_fields
df = df.assign(**d)[list(db_fields) + diff1]
print (df)
   id  email  status source   a
0  a1    NaN     NaN     a1  a1
1  a2    NaN     NaN     a2  a2
2  b1    NaN     NaN     b1  b1

edited Nov 26 '18 at 14:08

answered Nov 26 '18 at 12:57

jezrael

822,522
95
1,334
1,252

Since I'm doing this in loop for 5 diffrent dataframes. How can I make sure that columns will be in same order? If something was added it adds in the end.. so I can get two dataframe with same columns but in diffrent order. – Luis Nov 26 '18 at 13:56
@Luis - How working `df.assign(**d).sort_index(axis=1)` ? – jezrael Nov 26 '18 at 13:57
@Luis - Solution was modify - all columns defined in `db_fields` are in the end of data or in first positions – jezrael Nov 26 '18 at 14:10
I did df.reindex(db_fields) but I guess yours method works too. – Luis Nov 26 '18 at 14:19
1

@Luis - yes, if want only columns from `db_fields` then `df = df.assign(**d)[list(db_fields)]` is another solution or `df.reindex(db_fields, axis=1) ` – jezrael Nov 26 '18 at 14:21

score 1 · Answer 3 · answered Nov 26 '18 at 14:00

Here you have it, plain and simple, in just one line:

import numpy as np
db_fields = ("id", "email", "status", "source")
df = DataFrame(results)
for col in db_fields:
    if col not in df.columns:
        # Add the column
        df[col] = np.nan

BTW: You can also drop a column using df.drop(inplace=True).

How to add column to a data frame?

3 Answers3