2

Actually there are many parts in this question. I have solved some parts by viewing the answer of other stack overflow questions. However, there is still one tiny problem not solved. The output csv file tends to order alphabetically or in other order, which is not what I want.

As an example, I'm going to concatenate two csv files vertically. The two csv look like the following.

    a.csv
    B, A, C, E
    1, 1, 1, 1

    b.csv
    B, A, D, C
    2, 2, 2, 2

The result I'd like to get is

    c.csv
    B, A, D, C, E
    1, 1,  , 1, 1
    2, 2, 2, 2,   

First, I read them into pandas data frames.

    a = pd.read_csv("a.csv")
    b = pd.read_csv("b.csv")

Then concatenate them and write to csv by

    c = pd.concat([a, b], join='outer')
    c.to_csv("c.csv", index=False)

The output csv looks like

    c.csv
    A, C, D, B, E
    1, 1,  , 1, 1
    2, 2, 2,  , 2 

Is there any way to solve the problem? I once thought of something like the code from the answer Preserving column order in Python Pandas DataFrame

    df.to_csv("dfTest.txt","\t",header=True,cols=["b","a","c"], engine='python')

However, there are hundreds of columns in my csv file, I can't manually write down the order of column names. And for each group of files, the column names are different. I tried

    set(a.columns.values).union(list(b.columns.values))

It also doesn't work, because set will disorder the list.

Community
  • 1
  • 1
Chenlu
  • 449
  • 1
  • 6
  • 19

2 Answers2

1

Build up an output order which you can then supply to c.to_csv(...), eg:

from collections import OrderedDict
out_order = OrderedDict.fromkeys(a.columns)
out_order.update(OrderedDict.fromkeys(b.columns))
out_order = list(out_order)
# ['B', 'A', 'C', 'E', 'D']

c.to_csv("c.csv", index=False, columns=out_order)
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thanks. This is almost what I want. Is there anyway to make the final out_order ['B', 'A', 'D', 'C', 'E'], which doesn't simply put the extra columns from b at the end of c.csv? – Chenlu Aug 15 '16 at 13:20
0

You almost have it with a.columns

col_names = a.columns.tolist() # list of column names
sorted_cols = sorted(col_names)

df.to_csv("dfTest.txt","\t",header=True,cols=sorted_cols, engine='python')

In one line:

df.to_csv("dfTest.txt","\t",
          header=True,
          cols=sorted(a.columns.tolist()),
          engine='python')
rwester
  • 154
  • 9