Python 2.7 - merge two CSV files without headers and with two delimiters in the first file

Question

I have one csv test1.csv (I do not have headers in it!!!). I also have as you can see delimiter with pipe but also with exactly one tab after the eight column.

ug|s|b|city|bg|1|94|ON-05-0216  9.72|28|288
ug|s|b|city|bg|1|94|ON-05-0217  9.72|28|288

I have second file test2.csv with only delimiter pipe

ON-05-0216|100|50
ON-05-0180|244|152
ON-05-0219|269|146

So because only one value (ON-05-0216) is being matched from the eight column from the first file and first column from the second file it means that I should have only one value in output file, but with addition of SUM column from the second and third column from second file (100+50).

So the final result is the following:

ug|s|b|city|bg|1|94|ON-05-0216  Total=150|9.72|28|288

or

ug|s|b|city|bg|1|94|ON-05-0216|Total=150    9.72|28|288

whatever is easier.

I though that the best way to use is with pandas. But I stuck with taking multiple delimiters from the first file and how to match columns without column names, so not sure how to continue further.

import pandas as pd

a = pd.read_csv("test1.csv", header=None)
b = pd.read_csv("test2.csv", header=None)
merged = a.merge(b,)
merged.to_csv("output.csv", index=False)

Thank you in advance

Hi yes. One tab. ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288 — Veljko, Dec 21 '18 at 10:25

meW · Accepted Answer · 2018-12-21T14:18:19.697

1

Use:

# Reading files
df1 = pd.read_csv('file1.csv', header=None, sep='|')
df2 = pd.read_csv('file2.csv', header=None, sep='|')

# splitting file on tab and concatenating with rest 
ndf = pd.concat([df1.iloc[:,:7], df1[7].str.split('\t', expand=True), df1.iloc[:,8:]], axis=1)
ndf.columns = np.arange(11)

# adding values from df2 and bringing in format Total=sum
df2.columns = ['c1', 'c2', 'c3']
tot = df2.eval('c2+c3').apply(lambda x: 'Total='+str(x))

# Finding which rows needs to be retained
idx_1 = ndf.iloc[:,7].str.split('-',expand=True).iloc[:,2]
idx_2 = df2.c1.str.split('-',expand=True).iloc[:,2]
idx = idx_1.isin(idx_2)      # Updated
ndf = ndf[idx].reset_index(drop=True)
tot = tot[idx].reset_index(drop=True)

# concatenating both CSV together and writing output csv
ndf.iloc[:,7] = ndf.iloc[:,7].map(str) + chr(9) + tot
pd.concat([ndf.iloc[:,:8],ndf.iloc[:,8:]], axis=1).to_csv('out.csv', sep='|', header=None, index=None)

# OUTPUT
# ug|s|b|city|bg|1|94|ON-05-0216    Total=150|9.72|28|288

edited Dec 21 '18 at 14:18

answered Dec 21 '18 at 10:34

meW

3,832
7
27

Hi. There is one mistake. I do not have ON-05-0217. As I mentioned I should not get in result row ON-05-0217 because it is not being matched with any row in the second file but I am getting as your result also Total=396 and it is wrong because that is for ON-05-0180 value. I hope I did not confuse you? – Veljko Dec 21 '18 at 10:45
Thank you. Please if it is not too complicated, it will mean a lot to me. – Veljko Dec 21 '18 at 10:48
idx = idx_1 == idx_2 ValueError: Can only compare identically-labeled Series objects – Veljko Dec 21 '18 at 12:35
Check if you're getting the last values like 216, 217 in idx1 and 216, 180 in idx2 – meW Dec 21 '18 at 12:38
print (idx_1) 0 0216 1 0217 Name: 2, dtype: object – Veljko Dec 21 '18 at 12:41
print idx_2 0 0216 1 0180 2 0219 Name: 2, dtype: object – Veljko Dec 21 '18 at 12:42
OK I got your problem. There are 2 rows in file1 but 3 rows in file2. I've considered only first two from file2. You can do that too, after reading file2 do `df2=df2.iloc[:len(df1),:]` – meW Dec 21 '18 at 12:46
you are pure genius! I marked answer. Can you explain shortly what np.arange(11), iloc and ndf and tot work? – Veljko Dec 21 '18 at 13:13
I noticed one more issue. If row in the first file is not on the same row in the second file it will not get into the output :( for example like this: file1.csv ug|sbb.serbia|BGD|city|bg-bb-c-1.sbb.rs|1|94.D9.3C.00.00.8D|ON-05-0100 9.72|28|288 ug|sbb.serbia|BGD|city|bg-bb-c-1.sbb.rs|1|94.D9.3C.00.00.8D|ON-05-0101 9.72|28|288 ug|sbb.serbia|BGD|city|bg-bb-c-2.sbb.rs|1|94.D9.3C.00.00.8D|ON-05-0105 9.72|28|288 – Veljko Dec 21 '18 at 13:21
file2.csv ON-05-0100|50|50 ON-05-0101|100|50 ON-05-0102|200|50 ON-05-0105|200|50 – Veljko Dec 21 '18 at 13:21
I'll add outputs to each of the variables that will help you out understanding your asked terms. Plus, to answer your last comment, though it wasn't mentioned earlier, you can use `isin() ` to check if idx2 is in idx1 or vice versa as per you like. – meW Dec 21 '18 at 13:25
Thank you. And sorry on which exact places I should add isin() check? Can you please add that part in code? sorry for additional question – Veljko Dec 21 '18 at 13:38
@Dejan I've updated the answer. Now even if rows mismatch it'll still give output. – meW Dec 21 '18 at 14:19

score 0 · Answer 2 · answered Dec 21 '18 at 10:17

You can use pipe as a delimeter when reading csv pd.read_csv(... sep='|'), and only split the tab separated columns later on by using this example here.

When merging two dataframes, you must have a common column that you will merge on. You could use them as index for easier appending after you do the neccessary math on separate dataframes.

Python 2.7 - merge two CSV files without headers and with two delimiters in the first file

2 Answers2