how to concatenate two csv files and keep the original order of columns in python?

Question

Actually there are many parts in this question. I have solved some parts by viewing the answer of other stack overflow questions. However, there is still one tiny problem not solved. The output csv file tends to order alphabetically or in other order, which is not what I want.

As an example, I'm going to concatenate two csv files vertically. The two csv look like the following.

    a.csv
    B, A, C, E
    1, 1, 1, 1

    b.csv
    B, A, D, C
    2, 2, 2, 2

The result I'd like to get is

    c.csv
    B, A, D, C, E
    1, 1,  , 1, 1
    2, 2, 2, 2,

First, I read them into pandas data frames.

    a = pd.read_csv("a.csv")
    b = pd.read_csv("b.csv")

Then concatenate them and write to csv by

    c = pd.concat([a, b], join='outer')
    c.to_csv("c.csv", index=False)

The output csv looks like

    c.csv
    A, C, D, B, E
    1, 1,  , 1, 1
    2, 2, 2,  , 2

Is there any way to solve the problem? I once thought of something like the code from the answer Preserving column order in Python Pandas DataFrame

    df.to_csv("dfTest.txt","\t",header=True,cols=["b","a","c"], engine='python')

However, there are hundreds of columns in my csv file, I can't manually write down the order of column names. And for each group of files, the column names are different. I tried

    set(a.columns.values).union(list(b.columns.values))

It also doesn't work, because set will disorder the list.

You could read the column names from the first file, turn it into a string list and put it into the `.to_csv` argument `cols` . — meetaig, Aug 12 '16 at 20:51
yes. I used pandas, because I need to automatically match the column names of the two files. — Chenlu, Aug 12 '16 at 20:51
@ meetaig However, the columns in the first file is not all the columns I need for the output. — Chenlu, Aug 12 '16 at 20:52
What do you expect if `a.csv` had columns `B, A, C, E` and `b.csv` had columns `B, A, C, D`? What would be the order then? — piRSquared, Aug 12 '16 at 20:59
@ piRSquared In this case, both B, A, C, D, E and B, A, C, E, D are acceptable to me. — Chenlu, Aug 15 '16 at 13:14

score 1 · Answer 1 · answered Aug 12 '16 at 20:58

1

Build up an output order which you can then supply to c.to_csv(...), eg:

from collections import OrderedDict
out_order = OrderedDict.fromkeys(a.columns)
out_order.update(OrderedDict.fromkeys(b.columns))
out_order = list(out_order)
# ['B', 'A', 'C', 'E', 'D']

c.to_csv("c.csv", index=False, columns=out_order)

answered Aug 12 '16 at 20:58

Jon Clements

138,671
33
247
280

Thanks. This is almost what I want. Is there anyway to make the final out_order ['B', 'A', 'D', 'C', 'E'], which doesn't simply put the extra columns from b at the end of c.csv? – Chenlu Aug 15 '16 at 13:20

score 0 · Answer 2 · answered Aug 12 '16 at 20:57

You almost have it with a.columns

col_names = a.columns.tolist() # list of column names
sorted_cols = sorted(col_names)

df.to_csv("dfTest.txt","\t",header=True,cols=sorted_cols, engine='python')

In one line:

df.to_csv("dfTest.txt","\t",
          header=True,
          cols=sorted(a.columns.tolist()),
          engine='python')

how to concatenate two csv files and keep the original order of columns in python?

2 Answers2