How to combine 2 csv files with common column value, but both files have different number of lines

Question

file1.csv contains 2 columns: c11;c12
file2.csv contains 2 columns: c21;c22
Common column: c11, c21

Example:

f1.csv

a;text_a            
b;text_b            
f;text_f            
x;text_x

f2.csv

a;path_a
c;path_c
d;path_d
k;path_k
l;path_l
m:path_m

Output f1+f2:

a;text_a;path_a
b;text_b,''
c;'';path_c
d;'';path_d
f;text_f;''
k;'';path_k
l;'';path_l
m;'';path_m
x;text_x;''

How to realize it using python?

If you need just this, have a look at the command-line `join` tool: http://linux.die.net/man/1/join — eumiro, Aug 27 '12 at 10:41
Thank you for the suggestion, but an example how to use join command for this case is very welcome — user1042891, Aug 27 '12 at 13:30

score 4 · Answer 1 · answered Aug 27 '12 at 11:15

This is quite easily done with the csv module:

import csv

with open('file1.csv') as f:
    r = csv.reader(f, delimiter=';')
    dict1 = {row[0]: row[1] for row in r}

with open('file2.csv') as f:
    r = csv.reader(f, delimiter=';')
    dict2 = {row[0]: row[1] for row in r}

keys = set(dict1.keys() + dict2.keys())
with open('output.csv', 'wb') as f:
    w = csv.writer(f, delimiter=';')
    w.writerows([[key, dict1.get(key, "''"), dict2.get(key, "''")]
                 for key in keys])

Thank you for the response. I have an additional question. if file2.csv has 3 columns i.s.o 2 columns, the other conditions are same. Has this a great impact on the code? — user1042891, Aug 27 '12 at 12:32

score 1 · Answer 2 · answered Apr 22 '15 at 00:46

For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". You could even specify what fields need to be considered for merging and what fields need to be saved.

import brewery
from brewery
import ds
import sys

sources = [
    {"file": "grants_2008.csv",
     "fields": ["receiver", "amount", "date"]},
    {"file": "grants_2009.csv",
     "fields": ["id", "receiver", "amount", "contract_number", "date"]},
    {"file": "grants_2010.csv",
     "fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]

Create list of all fields and add filename to store information about origin of data records.Go through source definitions and collect the fields:

for source in sources:
    for field in source["fields"]:
        if field not in all_fields:

out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()

for source in sources:

    path = source["file"]

# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.

    src = ds.CSVDataSource(path,read_header=False,skip_rows=1)

    src.fields = ds.FieldList(source["fields"])

    src.initialize()


    for record in src.records():

   # Add file reference into ouput - to know where the row comes from
    record["file"] = path

        out.append(record)

# Close the source stream

    src.finalize()


cat merged.csv | brewery pipe pretty_printer

How to combine 2 csv files with common column value, but both files have different number of lines

2 Answers2

Linked