2

*New to Python.

I'm trying to merge multiple text files into 1 csv; example below -

filename.csv

Alpha

0
0.1
0.15
0.2
0.25
0.3

text1.txt

Alpha,Beta
0,10
0.2,20
0.3,30

text2.txt

Alpha,Charlie
0.1,5
0.15,15

text3.txt

Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10

Desired output in the csv file: -

filename.csv

Alpha  Beta  Charlie  Delta
  0     10     0        0
  0.1    0     5        10
  0.15   0     15       20
  0.2   20     0        50
  0.25   0     0        0
  0.3   30     0        10

The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page

def mergeData(indir="Dir Path", outdir="Dir Path"):
    dfs = []
    os.chdir(indir)
    fileList=glob.glob("*.txt")
    for filename in fileList:
        left= "/Path/Final.csv"
        right = filename
        output = "/Path/finalMerged.csv"
        leftDf = pandas.read_csv(left)
        rightDf = pandas.read_csv(right)
        mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
        dfs.append(mergedDf)
    outputDf = pandas.concat(dfs, ignore_index=True)
    outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
    print (outputDf)

    outputDf.to_csv(output, index=0)

mergeData()

The answer I get however is instead of the desired result: -

Alpha  Beta  Charlie  Delta
  0     10     0        0
  0.1    0     5        0
  0.1    0     0        10
  0.15   0     15       0
  0.15   0     0        20
  0.2   20     0        0
  0.2    0     0        50
  0.25   0     0        0
  0.3   30     0        0
  0.3    0     0        10
Max
  • 43
  • 7

2 Answers2

0

IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:

import pandas
import glob
import os

def mergeData(indir="dir/path", outdir="dir/path"):
    dfs = []
    os.chdir(indir)
    fileList=glob.glob("*.txt")
    for filename in fileList:
        left= "/path/filename.csv"
        right = filename
        output = "/path/filename.csv"
        leftDf = pandas.read_csv(left)
        rightDf = pandas.read_csv(right)
        mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
        dfs.append(mergedDf)
    outputDf = pandas.concat(dfs, ignore_index=True)
    #add missing rows from leftDf (in sample Alpha - 0.25) 
    #fill NaN values by 0
    outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
    #columns are converted to int
    outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int) 
    print (outputDf)

    outputDf.to_csv(output, index=0)

mergeData()

   Alpha  Beta  Charlie
0   0.00    10        0
1   0.10     0        5
2   0.15     0       15
3   0.20    20        0
4   0.25     0        0
5   0.30    30        0

EDIT:

Problem is you change parameter how='left' in second merge to how='inner':

def mergeData(indir="Dir Path", outdir="Dir Path"):
    dfs = []
    os.chdir(indir)
    fileList=glob.glob("*.txt")
    for filename in fileList:
        left= "/Path/Final.csv"
        right = filename
        output = "/Path/finalMerged.csv"
        leftDf = pandas.read_csv(left)
        rightDf = pandas.read_csv(right)
        mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
        dfs.append(mergedDf)
    outputDf = pandas.concat(dfs, ignore_index=True)
    #need left join, not inner
    outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
                     .fillna(0)
    print (outputDf)

    outputDf.to_csv(output, index=0)

mergeData()
   Alpha  Beta  Charlie  Delta
0   0.00  10.0      0.0    0.0
1   0.10   0.0      5.0    0.0
2   0.10   0.0      0.0   10.0
3   0.15   0.0     15.0    0.0
4   0.15   0.0      0.0   20.0
5   0.20  20.0      0.0    0.0
6   0.20   0.0      0.0   50.0
7   0.25   0.0      0.0    0.0
8   0.30  30.0      0.0    0.0
9   0.30   0.0      0.0   10.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • That did work. The issue is that its appending the way below. – Max Sep 05 '16 at 07:47
  • I dont understand. What is problem? – jezrael Sep 05 '16 at 07:50
  • That did work. It is how ever concatenating the data as below Alpha Beta Charlie 0 10 0 0.1 0 0 0.15 0 0 0.2 20 0 0 0 0 0.1 0 5 0.15 0 15 0.2 0 0 – Max Sep 05 '16 at 07:56
  • I little changed solution - do you use last version? I add comments to code what what changed. – jezrael Sep 05 '16 at 07:58
  • Got it. Thank you. – Max Sep 05 '16 at 08:03
  • Yea I just had to comment out outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int) Working with close to 2000 files. – Max Sep 05 '16 at 08:04
  • Yes, it can be removed without problem. Thank you for accepting! – jezrael Sep 05 '16 at 08:05
  • Jezrael, as soon as I start to merge multiple files, its appending and adding rows of the same alpha value still for each file. For example, if two files have values against 0.1, it'll add it the 0.1 row values twice, as well as for each other file (values of 0). I'm trying using the join method instead, which is a little better, only doing the former, but still have multiple rows for each alpha point. – Max Sep 05 '16 at 18:34
  • Ok, what happen if duplicates between files? Can you add some sample and desired output? Or better create new question? – jezrael Sep 06 '16 at 05:43
  • Updated the question to show the result and desired output. – Max Sep 06 '16 at 18:05
  • Even with inner, its the same issue. I was playing around to see if the result changes whether I had inner or left/right/outer. its the same for that merge. If you see the result you posted its still adding a new row for the same alpha value for different files. – Max Sep 07 '16 at 05:38
  • Interesting, for me it works nice. Second `merge` with `left` only add all rows, which are not in all looping files (in sample `0.25`). So there is problem? Or something different? What is your pandas version? I test it in `pandas 0.18.1`. – jezrael Sep 07 '16 at 08:17
  • I'm not sure what it is, I've been combing over it for a couple of days now. The only difference is the number of files I'm using (5, instead of 3 I've listed here and more rows). using 0.18.1 as well. – Max Sep 07 '16 at 14:56
  • So it works with sample `3` files, but not with real `txt`? – jezrael Sep 07 '16 at 14:57
  • It is weird. Pandas versions is same? If data are not confidental, you can send it by email in my profile. – jezrael Sep 07 '16 at 18:57
  • it works if you the files dont have the same "alpha" value. If they do, then the same alpha value gets appended in the next row for each file. All this when reading in text files – Max Sep 07 '16 at 19:01
  • Yes, I can send it over. – Max Sep 07 '16 at 19:01
  • And please send me some sample of desired output for checking if it works right. Thanks. – jezrael Sep 07 '16 at 19:03
0
import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())

output:
  Alpha  Beta  Charlie
0   0.00  10.0      NaN
1   0.10   NaN      5.0
2   0.15   NaN     15.0
3   0.20  20.0      NaN
4   0.25   NaN      NaN
5   0.30  30.0      NaN
SUNITHA K
  • 150
  • 1
  • 3