0

I have one csv test1.csv (I do not have headers in it!!!). I also have as you can see delimiter with pipe but also with exactly one tab after the eight column.

ug|s|b|city|bg|1|94|ON-05-0216  9.72|28|288
ug|s|b|city|bg|1|94|ON-05-0217  9.72|28|288

I have second file test2.csv with only delimiter pipe

ON-05-0216|100|50
ON-05-0180|244|152
ON-05-0219|269|146

So because only one value (ON-05-0216) is being matched from the eight column from the first file and first column from the second file it means that I should have only one value in output file, but with addition of SUM column from the second and third column from second file (100+50).

So the final result is the following:

ug|s|b|city|bg|1|94|ON-05-0216  Total=150|9.72|28|288

or

ug|s|b|city|bg|1|94|ON-05-0216|Total=150    9.72|28|288

whatever is easier.

I though that the best way to use is with pandas. But I stuck with taking multiple delimiters from the first file and how to match columns without column names, so not sure how to continue further.

import pandas as pd

a = pd.read_csv("test1.csv", header=None)
b = pd.read_csv("test2.csv", header=None)
merged = a.merge(b,)
merged.to_csv("output.csv", index=False)

Thank you in advance

Veljko
  • 1,708
  • 12
  • 40
  • 80

2 Answers2

1

Use:

# Reading files
df1 = pd.read_csv('file1.csv', header=None, sep='|')
df2 = pd.read_csv('file2.csv', header=None, sep='|')

# splitting file on tab and concatenating with rest 
ndf = pd.concat([df1.iloc[:,:7], df1[7].str.split('\t', expand=True), df1.iloc[:,8:]], axis=1)
ndf.columns = np.arange(11)

# adding values from df2 and bringing in format Total=sum
df2.columns = ['c1', 'c2', 'c3']
tot = df2.eval('c2+c3').apply(lambda x: 'Total='+str(x))

# Finding which rows needs to be retained
idx_1 = ndf.iloc[:,7].str.split('-',expand=True).iloc[:,2]
idx_2 = df2.c1.str.split('-',expand=True).iloc[:,2]
idx = idx_1.isin(idx_2)      # Updated
ndf = ndf[idx].reset_index(drop=True)
tot = tot[idx].reset_index(drop=True)

# concatenating both CSV together and writing output csv
ndf.iloc[:,7] = ndf.iloc[:,7].map(str) + chr(9) + tot
pd.concat([ndf.iloc[:,:8],ndf.iloc[:,8:]], axis=1).to_csv('out.csv', sep='|', header=None, index=None)

# OUTPUT
# ug|s|b|city|bg|1|94|ON-05-0216    Total=150|9.72|28|288
meW
  • 3,832
  • 7
  • 27
  • Hi. There is one mistake. I do not have ON-05-0217. As I mentioned I should not get in result row ON-05-0217 because it is not being matched with any row in the second file but I am getting as your result also Total=396 and it is wrong because that is for ON-05-0180 value. I hope I did not confuse you? – Veljko Dec 21 '18 at 10:45
  • Thank you. Please if it is not too complicated, it will mean a lot to me. – Veljko Dec 21 '18 at 10:48
  • idx = idx_1 == idx_2 ValueError: Can only compare identically-labeled Series objects – Veljko Dec 21 '18 at 12:35
  • Check if you're getting the last values like 216, 217 in idx1 and 216, 180 in idx2 – meW Dec 21 '18 at 12:38
  • print (idx_1) 0 0216 1 0217 Name: 2, dtype: object – Veljko Dec 21 '18 at 12:41
  • print idx_2 0 0216 1 0180 2 0219 Name: 2, dtype: object – Veljko Dec 21 '18 at 12:42
  • OK I got your problem. There are 2 rows in file1 but 3 rows in file2. I've considered only first two from file2. You can do that too, after reading file2 do `df2=df2.iloc[:len(df1),:]` – meW Dec 21 '18 at 12:46
  • you are pure genius! I marked answer. Can you explain shortly what np.arange(11), iloc and ndf and tot work? – Veljko Dec 21 '18 at 13:13
  • I noticed one more issue. If row in the first file is not on the same row in the second file it will not get into the output :( for example like this: file1.csv ug|sbb.serbia|BGD|city|bg-bb-c-1.sbb.rs|1|94.D9.3C.00.00.8D|ON-05-0100 9.72|28|288 ug|sbb.serbia|BGD|city|bg-bb-c-1.sbb.rs|1|94.D9.3C.00.00.8D|ON-05-0101 9.72|28|288 ug|sbb.serbia|BGD|city|bg-bb-c-2.sbb.rs|1|94.D9.3C.00.00.8D|ON-05-0105 9.72|28|288 – Veljko Dec 21 '18 at 13:21
  • file2.csv ON-05-0100|50|50 ON-05-0101|100|50 ON-05-0102|200|50 ON-05-0105|200|50 – Veljko Dec 21 '18 at 13:21
  • I'll add outputs to each of the variables that will help you out understanding your asked terms. Plus, to answer your last comment, though it wasn't mentioned earlier, you can use `isin() ` to check if idx2 is in idx1 or vice versa as per you like. – meW Dec 21 '18 at 13:25
  • Thank you. And sorry on which exact places I should add isin() check? Can you please add that part in code? sorry for additional question – Veljko Dec 21 '18 at 13:38
  • @Dejan I've updated the answer. Now even if rows mismatch it'll still give output. – meW Dec 21 '18 at 14:19
0

You can use pipe as a delimeter when reading csv pd.read_csv(... sep='|'), and only split the tab separated columns later on by using this example here.

When merging two dataframes, you must have a common column that you will merge on. You could use them as index for easier appending after you do the neccessary math on separate dataframes.

Marjan Moderc
  • 2,747
  • 23
  • 44