0

I have multiple CSV files with same number of columns BUT different column orders in each , I wanted to merge them removing duplicates, all of the other solutions here dont consider column order hence merging output is incorrect, Hence how to do it in either windows commandline(e.g logparser) or bash?

Also python script to achieve this would also do.

stackit
  • 3,036
  • 9
  • 34
  • 62
  • 1
    There is some ambiguity in just saying you want "to merge them removing duplicates" while you also "consider column order". Examples of input files and desired output would help. – John1024 Apr 25 '14 at 05:27

3 Answers3

1

The following script works properly if:

  • csv aren't too big (i.e. can be loaded in memory)
  • the first row of the CSV contains the column names

You only have to fill files and final_headers

import csv

files = ['c1.csv', 'c2.csv', 'c3.csv']
final_headers = ['col1', 'col2', 'col3']

merged_rows = set()
for f in files:
    with open(f, 'rb') as csv_in:
        csvreader = csv.reader(csv_in, delimiter=',')
    headers = dict((h, i) for i, h in enumerate(csvreader.next()))
        for row in csvreader:
            merged_rows.add(tuple(row[headers[x]] for x in final_headers))
with open('output.csv', 'wb') as csv_out:
    csvwriter = csv.writer(csv_out, delimiter=',')
    csvwriter.writerows(merged_rows)
franz.fonta
  • 423
  • 4
  • 12
  • how does a named tuple help is detecting duplicates? what if there is some noise in the files? – stackit Apr 25 '14 at 17:42
  • @Ali I made a couple of changes (actually the named tuple wasn't necessary). The duplicates are "detected" by the [set](https://docs.python.org/2/tutorial/datastructures.html#sets) that don't allow duplicate elements. Are you expecting noise in the header names? – franz.fonta Apr 25 '14 at 20:43
  • Ya , the set does not allow duplicates but is the tuple as a whole behaves like a string? – stackit Apr 26 '14 at 04:20
  • @Ali I'm not sure to understand what you mean. The tuple is hashable (that's why you can use it as a set value), and two tuples containing the same values are considered equal. With regard to this case, what other tuple behavior leaves you in doubt? – franz.fonta Apr 26 '14 at 06:45
1

csvkit's csvjoin can do that.

csvjoin -c "Column 1,Column 2" --outer file1.csv file2.csv
user4769588
  • 11
  • 1
  • 1
0

Personally, I would separate the two tasks of merging files and removing duplicates. I would also recommend using a database instead of CSV files if that's an option, since managing columns in a database is easier.

Here is an example using Python, which has a csv library that is easy to use.

import csv
with open(srcPath, 'r') as srcCSV:
    csvReader = csv.reader(csvFile, delimiter = ',')

    with open(destPath, 'rw') as destCSV:
        csvWriter = csv.writer(destCSV, delimiter = ',')        

        for record in csvReader:
            csvWriter.writerow(record[1],record[3],record[2], ... record[n])

This allows you to rewrite the columns in any order you choose. The destination CSV could be an existing one that you expand, or it could be a new one with a better format. Using the CSV library will help prevent transcription errors that would happen elsewhere.

Once the data is consolidated, you could use the same library to iterate over the single data file to identify records that are identical.

Note: this method reads and writes files a line at a time, so it can process files of any size. I used this method to consolidate 221 millions records from files as large as 6 GB each.

vastlysuperiorman
  • 1,694
  • 19
  • 27