Merge 2 csv file with one unique column but different header

Question

I want to merge 2 csv file using some scripting language (like bash script or python).

1st.csv (this data is from mysql query)

member_id,name,email,desc
03141,ej,ej@domain.com,cool
00002,jes,jes@domain.com,good
00002,charmie,charm@domain.com,sweet

2nd.csv (from mongodb query)

id,address,create_date
00002,someCity,20150825
00003,newCity,20140102
11111,,20150808

The examples are not the actual, though i know that some of the member_id from qsl and the id from mongodb are the same. (*and i wish my output will be something like this)

desiredoutput.csv

meber_id,name,email,desc,address,create_date
03141,ej,ej@domain.com,cool,,
00002,jes,jes@domain.com,good,someCity,20150825
00002,charmie,charm@domain.com,sweet,
11111,,,,20150808

help will be much appreciated. thanks in advance

#########################################################################

#!/usr/bin/python
import csv
import itertools as IT

filenames = ['1st.csv', '2nd.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]

with  open('desiredoutput.csv', 'wb') as h:
    writer = csv.writer(h, delimiter=',', lineterminator='\n', )
    for rows in IT.izip_longest(*readers, fillvalue=['']*2):
        combined_row = []
        for row in rows:
            row = row[:1] # column where 1 know there are identical data
            if len(row) == 1:
                combined_row.extend(row)
            else:
                combined_row.extend(['']*1)
        writer.writerow(combined_row)

for f in handles:
    f.close()

#########################################################################

just read and tried this code(manipulate) in this site too

yes i have tried merging using bash, happen is (it merges but it does create duplicate rather than merge in single info) im new in scripting pls help — chareon, Aug 26 '15 at 01:10
Use regular expression to find matching pattern for header in csv. — itwasntme, Aug 26 '15 at 01:17
@teahaa please edit your question to include the code you tried, I'm sure many here would be happy to help you debug it and fix where it's going wrong, but it's easier to help someone when you can see what's going wrong — Eric Renouf, Aug 26 '15 at 01:29
@EricRenouf thanks man i have update my question. im new here still learning new things eagerly — chareon, Aug 26 '15 at 01:38
@JonatasCD hmmn the values of my files are pretty like unsorted and some values are not present on each file(first column specifically). — chareon, Aug 26 '15 at 02:32

Jashandeep Sohi · Accepted Answer · 2015-08-26T09:32:40.097

1

Since you haven't posted an attempt, I'll give you a general answer (using Python) to get you started.

Create a dict, d
Iterate over all the rows of the first file, convert each row into a list and store it in d using meber_id as the key and the list as the value.
Iterate over all the rows of the second file, convert each row into a list leaving out the id column and update the list under d[id] with the new list if d[id] exists, otherwise store the new list under d[id].
Finally, iterate over the values in d and print them out comma separated to a file.

Edit

In your attempt, you are trying to use izip_longest to iterate over the rows of both files at the same time. But this would work only if there were an equal number of rows in both files and they were in the same order.

Anyhow, here is one way of doing it.

Note: This is using the Python 3.4+ csv module. For 2.7 it might look a little different.

import csv

d = {}

with open("file1.csv", newline="") as f:
  for row in csv.reader(f):
    d.setdefault(row[0], []).append(row + [""] * 3)

with open("file2.csv", newline="") as f:
  for row in csv.reader(f):
    old_row = d.setdefault(row[0][0], [row[0], "", "", ""])
    old_row[4:] = row[1:]

with open("out.csv", "w", newline="") as f:
  writer = csv.writer(f)
  for rows in d.values():
    writer.writerows(rows)

edited Aug 26 '15 at 09:32

answered Aug 26 '15 at 01:40

Jashandeep Sohi

4,903
2
23
25

thanks man, some how i got the algo. can i have some codes where i can play and study with. this will be great aid for me – chareon Aug 26 '15 at 01:45
on this code >>> d.setdefault(row[0], [row[0], "", "", ""]).extend(row[1:]) <<< what should if i want to retain duplicates in my member_id? thanks man – chareon Aug 26 '15 at 07:09
I don't quite understand the question. You have duplicate `member_id`s? – Jashandeep Sohi Aug 26 '15 at 08:01
it has duplicate on member_id (which is denoted as for example that member_id are common for some columns) the 1st.csv , i just remove it's ID(which are unique but has no relation at all in 2nd.csv) the id in 2nd csv is unique and has the same value in 1st.csv(member_id) – chareon Aug 26 '15 at 08:34
Oh, didn't see that. That changes things... – Jashandeep Sohi Aug 26 '15 at 08:46
i already fix the problem.. your a greate man mate ! hahaha thanks! but if you want to improve the code, it will be good to other readers, thanks for helping – chareon Aug 26 '15 at 08:50

score 0 · Answer 2 · edited May 23 '17 at 12:03

0

Here goes a suggestion using pandas I've got from this answer and pandas doc about merging.

import pandas as pd
first = pd.read_csv('1st.csv')
second = pd.read_csv('2nd.csv')
merged = pd.concat([first, second], axis=1)

This will output:

meber_id  name     email            desc    id      address     create_date
3141      ej       ej@domain.com    cool    2       someCity    20150825
2         jes      jes@domain.com   good    11      newCity     20140102
11        charmie  charm@domain.com sweet   11111   NaN         20150808

edited May 23 '17 at 12:03

Community

1
1

answered Aug 26 '15 at 02:53

Jonatas CD

878
2
10
19

actually what u did is just merging, what i really need is that i merge the information member_id = id, include information with respect to column and row. btw thanks – chareon Aug 26 '15 at 04:09
You want to use an outer [`join`](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging). So `pd.merge(first, second, how="outer", left_on = "meber_id", right_on = "id")` should do it. – Jashandeep Sohi Aug 26 '15 at 08:27

Merge 2 csv file with one unique column but different header

2 Answers2