1

I have an EXCEL table that I want to transfer into a dataframe matching our project's standard with 22 different columns. The original EXCEL table, however, only has 13 columns, so I am trying to add the missing ones to the dataframe I have read from the file.

However, this has caused several challenges:

  1. When assigning an empty list [] to the dataframe, I get the notification that the size of the added columns does not match the original dataframe, which has circa 9000 rows.

  2. When assigning np.nan to the dataframe, creating the joint dataframe with all required columns works perfectly:

f_unique.loc[:, "additional_info"] = np.nan

But having np.nan in my data causes issues later in my script when I flatten the cell data as all other cells contain lists.

So I have tried to replace np.nan by a list containing the string "n/a":

grouped_df = grouped_df.replace(np.nan, ["n/a"])

However, this gives me the following error:

TypeError: Invalid "to_replace" type: 'float'

Is there a way in which I can assign 9000 x ["n/a"] to each new column in my dataframe directly? That would most likely solve the issue.

OnceUponATime
  • 450
  • 4
  • 12
  • 1
    wy don't you just [`reindex`]() with the lis of wanted columns? `df.reindex(columns=list_of_cols)`? – mozway Jun 26 '23 at 08:36
  • I tried that, but it only gave me the original 13 columns. But good to know it OUGHT to work. – OnceUponATime Jun 26 '23 at 08:44
  • Better provide a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), there might be a tiny detail that we could spot easily ;) – mozway Jun 26 '23 at 08:46
  • @OnceUponATime - Seems need `df.reindex(df.columns.union(new_cols, sort=False), axis=1, fill_value=["n/a"])` – jezrael Jun 26 '23 at 08:48
  • I found the issue! I was using the same df dataframe name after reindexing, but when assigning a new variable, it worked! ```df2 = f_unique.reindex(columns=column_names, fill_value="n/a")``` – OnceUponATime Jun 26 '23 at 08:50
  • @OnceUponATime it shouldn't have been an issue, using a new name should map to the new data. – mozway Jun 26 '23 at 08:57

3 Answers3

1

Use DataFrame.reindex with Index.union and for new columns use for filling list:

df = pd.DataFrame({'a':range(3)})

new_cols = ['additional_info','new']

df = df.reindex(df.columns.union(new_cols, sort=False), axis=1, fill_value=[])
print (df)
   a additional_info new
0  0              []  []
1  1              []  []
2  2              []  []

df = df.reindex(df.columns.union(new_cols, sort=False), axis=1, fill_value=["n/a"])
print (df)
   a additional_info    new
0  0           [n/a]  [n/a]
1  1           [n/a]  [n/a]
2  2           [n/a]  [n/a]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Just reindex:

out = df.reindex(columns=list_of_cols)

If you want a list as default value (which you should really avoid):

out = df.reindex(columns=list_of_cols, fill_value=['n/a'])
mozway
  • 194,879
  • 13
  • 39
  • 75
1

At the end of the day, reindexing (see answers) above in combination with assigning the same list multiple times to a dataframe column worked best for my use case:

df2 = f_unique.reindex(columns=column_names, fill_value="n/a")

# populate some of the empty columns with data

df2.loc[:, "event_end"] = df2["event_start"]
df2.loc[:, "event_type"] = ["Funktionsausübung"] * 31414

Adding a list multiple times is certainly not the most elegant solution, but it did the trick.

OnceUponATime
  • 450
  • 4
  • 12