0

I have six CSV files for six different years and I'd like to combine them into a single dataframe, with the column headers appropriately labelled.

Each raw CSV file looks like this (e.g. 2010.csv)

state,gender,population
FL,m,2161612
FL,f,2661614
TX,m,3153523
TX,f,3453523
...

And this is the structure I'd like to end up with:

state    gender    population_2010   population_2012   population_2014  .....
FL       m         2161612           xxxxxxx           xxxxxxx          .....
FL       f         2661614           xxxxxxx           xxxxxxx          .....
TX       m         3153526           xxxxxxx           xxxxxxx          .....
TX       f         3453523           xxxxxxx           xxxxxxx          .....

How can I do this efficiently? Currently I have this:

df_2010 = pd.read_csv("2010.csv")
df_2012 = pd.read_csv("2012.csv")
...

temp = df_2010.merge(df_2012, on=("state", "gender"), how="outer", suffixes=("_2010", "_2012")
temp1 = temp.merge(df_2014, on=("state", "gender"), how="outer", suffixes=(None, "_2014")
... repeat five more times to get the final dataframe

But I feel there must be a better way.

Richard
  • 62,943
  • 126
  • 334
  • 542
  • Disclaimer: I barely dabbled into pandas so the following might be wrong and should be taken as a grain of salt but [this](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) should have the answer that you're looking for which uses `pd.concat()`. – Dương Tiểu Đồng Sep 17 '21 at 16:30

1 Answers1

0

Try concat on axis 1 after setting state and gender as index

l = ['2010.csv','2012.csv']
out = pd.concat((pd.read_csv(file).set_index(['state','gender'])
        .add_suffix(file.split(".")[0]) for file in l),axis=1)
out = out.reset_index() #finally reset the index if needed

Note that if you have the original path of the files, you may need to replace file.split(".")[0] with something like os.path.split to grab the filename without extension

anky
  • 74,114
  • 11
  • 41
  • 70
  • Thank you! That's very close, but doesn't rename the column headers. – Richard Sep 17 '21 at 16:47
  • @Richard I manually types it out but add_suffix should do the trick, try `pd.DataFrame({"population":[1,2,3]}).add_suffix("_2010")` to see how it works. You can adjust the filename split accordingly – anky Sep 17 '21 at 16:48