1

I have two CSV files with different number of columns and rows. The first CSV file has M columns and N rows, the second has H columns and G rows. Some of the columns have the same name.

I'd like to combine the two into data frame with following properties:

  • N+G rows
  • Union of (M, H) columns
  • if column A is element of first CSV file but not of second, the data frame should contain the same values in first N entries of A as in first CSV, and for the rest (since there is no A data in second CSV) should be NA.

Here is an example:

CSV1
City, Population,
Zagreb, 700000,
Rijeka, 142000

CSV2
City, Area,
Split, 200.00
Osijek, 171.00
Dubrovnik, 143.35

I'd like build a data frame that looks like this:

City      Population   Area
Zagreb    700000       NA
Rijeka    142000       NA
Split     NA           200.00
Osijek    NA           171.00
Dubrovnik NA           143.35

Also what if instead two CSV files I had two data frames and wanted to do the same, for example if I loaded first csv to df1 and second one in df2 and then wanted to make a merge to df3 that would look like example above.

enedene
  • 3,525
  • 6
  • 34
  • 41

1 Answers1

4

Why not try the concat function:

In [25]: df1
Out[25]: 
     City  Population
0  Zagreb      700000
1  Rijeka      142000

In [26]: df2
Out[26]: 
        City    Area
0      Split  200.00
1     Osijek  171.00
2  Dubrovnik  143.35

In [27]: pd.concat([df1,df2])
Out[27]: 
     Area       City  Population
0     NaN     Zagreb      700000
1     NaN     Rijeka      142000
0  200.00      Split         NaN
1  171.00     Osijek         NaN
2  143.35  Dubrovnik         NaN

In [28]: pd.concat([df1,df2], ignore_index=True)
Out[28]: 
     Area       City  Population
0     NaN     Zagreb      700000
1     NaN     Rijeka      142000
2  200.00      Split         NaN
3  171.00     Osijek         NaN
4  143.35  Dubrovnik         NaN

Note: the concat does have some additional options if you have slightly different requirements.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • @enedene I have a large no. of Columns to concat i dont wanna write name of each of these how can i do this? –  Jun 07 '16 at 19:07
  • @Danish I don't understand the question. please consider asking a separate question (that way more people can see it too). – Andy Hayden Jun 07 '16 at 19:43
  • yes i can ask a seperate question but this one one is quite simple. I am just asking that lets say I have 200 columns on which i have to concat 2 data frame so rather than writing * pd.concat([mdf1[['user', 'tag1', 'tag2', 'tag3', 'tag4' ]].groupby(['user']).agg(sum) * I can't write name of individually columns so there must be some way for which it takes all the columns after 'tag1' in dataframe. I hope you got my question @AndyHayden –  Jun 08 '16 at 04:29
  • @Danish it takes a list, so you can do a list comprehension or something like that. ps. you can use .sum() rather than .agg(sum) – Andy Hayden Jun 08 '16 at 05:46
  • @AndyHayden I have already tried putting a list of columns in the argument but it is giving me error. Ideally it shouldn't? and yes thank you for .sum() –  Jun 08 '16 at 05:52
  • @Danish It shouldn't be an error you do df[[col for col in df if col != "foo"]] or whatever. You'll definitely get a better/faster response and from more people if you post a question (it can also have the full problem/error etc.) :) – Andy Hayden Jun 08 '16 at 16:19