2

I have several data files (from excel) converted to pandas format and contained in a dictionary. Here is how i obtain the dataframes:

dataList = files

nameList = []
for raster in dataList:
        path_list = raster.split(os.sep)
        name = path_list[5][:-4]
        nameList.append(name)

dataDct = {}
for k, v in zip(nameList,dataList):
    dataDct[k] = read_excel(v).rename(columns={'Main Value': 'Main Value '+k, 'Auxiliary Value': 'Auxiliary Value '+k})

I combine them using: (the dictionary being dataDct )

concat(dataDct.values(), join='outer', ignore_index=False).to_excel(writer, sheet_name='values').to_excel(writer, sheet_name='values')
writer.save()

This outputs a file wich only puts the files one after the other in a long file, disregarding the common fields they share... So it's difficult to use it for further analysis.

Here is a sample of the dataframes contained in the dataDct:

IN [2]:
value(HIB)
      1  CODE        VALUE_HIB       AUX_VAL_HIB
      2  F.F         00000    
      3  0.0.1   
      4  0.0.2       06-02-2016
      5  C.6         XYZ-21555FFF
      6  3.8.0*1     45000GHZ        01.01.2016

   Value (HIC)
      1  CODE        VALUE_HIC       AUX_VAL_HIC
      2  F.F         00000           111111
      3  0.0.1   
      4  0.0.3       06-02-2016
      5  C.6         XYZ-216666FFF
      6  3.9.0*1     65000GHZ        01.02.2016

   Value (HID)
      1   CODE        VALUE_HID       AUX_VAL_HID
      2  F.F         00000           0101010
      3  A.1.1       85 GHZ
      4  V.1.1       06-02-2016
      5  C.6         XYZ-21776FFF
      6  3.9.0*1     3000GHZ        01.02.2016

Expected output would be:

OUT[2]:
1  CODE    VALUE_HIB   AUX_VAL_HIB  VALUE_HIC   AUX_VAL_HIC  VALUE_HID   AUX_VAL_HID
2  F.F     00000                    00000       111111       00000      01010101
3  0.0.1   
4  0.0.2   06-02-2016
5  0.0.3                             06-02-2016
6  A.1.1                                                      85ghz
7  C.6     XYZ-21555FFF             XYZ-216666FFF            XYZ-21776FFF
8  V.1.1                                                      06-02-2016
9  3.8.0*1 45000GHZ     01.01.2016
10  3.9.0*1                          65000GHZ     01.02.2016   3000GHZ          01.02.2016

The idea being that the data is aligned and joined... I have tried joining on the axis:

concat(dataDct.values(), join='outer', ignore_index=False, axis=1).to_excel(writer, sheet_name='values')

The data is better displayed but is not joined on similar values :( Only joined on DAATAFRAMES displaying one after another... here is the example output of the error:

1  CODE        VALUE_HIB       AUX_VAL_HIB          VALUE_HIC       AUX_VAL_HIC
      2  F.F         00000    
      3  0.0.1   
      4  0.0.2       06-02-2016
      5  C.6         XYZ-21555FFF
      6  3.8.0*1     45000GHZ        01.01.2016



        F.F                                           00000           111111
        0.0.1   
       0.0.3                                          06-02-2016
        C.6                                                           XYZ-216666FFF
       3.9.0*1                                         65000GHZ        01.02.2016

Any ideas on what would be the correct syntax to combine the dictionary and obtain the desired output?

did12345
  • 57
  • 1
  • 8
  • Could you provide a minimal example, for instance with two small dataframes in `dataDct`, and also show what result you would expect from combining them? – IanS Feb 24 '16 at 04:15
  • Just edited the code with clear examples of the data and output desired – did12345 Feb 25 '16 at 20:19
  • So each of the values in `dataDct` are `DataFrames`? – Kartik Feb 25 '16 at 21:10

2 Answers2

0

Umm, I think you are concatenating on the wrong axis. Your concatenation is happening along the columns, so each DataFrame is being put under another DataFrame. I think you are looking for concatenation along the rows, where each DataFrame is aligned and merged with the other DataFrames by row, so that you get a wide table, rather than a long table.

Try:

concat(dataDct.values(), axis=1, join='outer', join_axes='CODE', ignore_index=False).to_excel(writer, sheet_name='values')
writer.save()

Let me know if it doesn't work. If this is not the answer you are looking for, please edit your question to provide a Minimal Complete and Verifiable Example. Also, please adhere to the Pandas question guidelines.

Community
  • 1
  • 1
Kartik
  • 8,347
  • 39
  • 73
  • just edited the question to make it clearer, the code worked but sadly did not join the similar values, would the indexing on the dataframes be a part of this? – did12345 Feb 25 '16 at 20:18
  • I hope that my edition is correctly edited to clearly define the issue – did12345 Feb 25 '16 at 20:19
  • Each of the values in `dataDct` are `DataFrames`, correct? If so, pass `join_axes='CODE'` as in my edited answer. – Kartik Feb 25 '16 at 21:10
  • i got the following error: `AssertionError: length of join_axes must not be equal to 1` – did12345 Feb 26 '16 at 02:00
0

What worked for me was setting the column CODE as index:

for key in dataDct:
    dataDct[key].set_index('CODE', inplace=True)

Then the following worked as expected:

pd.concat(dataDct.values(), axis=1)

If you have NaN values you can remove them by appending .fillna("").


Edit: Based on your comment it seems that you have duplicate values in the CODE column. One solution would be to drop the duplicates:

for key in dataDct:
    dataDct[key].set_index('CODE', inplace=True)
    dataDct[key].drop_duplicates(inplace=True)
IanS
  • 15,771
  • 9
  • 60
  • 84
  • now i get this error: `ValueError: Shape of passed values is (6, 536), indices imply (6, 490)` :( – did12345 Feb 26 '16 at 18:12
  • Can you check if you have duplicate values in your `CODE` column? Or `NaN` values? If so, you'd have to filter them out first. – IanS Feb 26 '16 at 18:17
  • indeed there are duplicate values after the concatenation is done... so i'd be using the function `drop_duplicates` i suppose, – did12345 Feb 26 '16 at 22:10
  • You'd have to post example data that reproduces the error, otherwise it's just guessing. – IanS Feb 29 '16 at 05:18
  • just added the actual output – did12345 Mar 15 '16 at 00:16
  • @did12345: I wanted input that reproduces your error. That being said I'm pretty sure it comes from duplicates, so I've edited my answer accordingly. – IanS Mar 15 '16 at 10:35