2

I am trying merge multiple files based on a key ('r_id') and rename the column names in the output with the name of the files. I could able to do every thing except renaming the output with the file names. I have the following error probably caused by the old version of Pandas. Does any one know how to fix this with out updating pandas to new version?

Error

    Traceback (most recent call last):                                       
      File "multijoin_2.py", line 19, in <module>                            
        result = merge_files(files).reset_index()                            
      File "multijoin_2.py", line 11, in merge_files                         
        pd.read_csv(f, sep='\t', usecols=['r_id', 'exp'])          
      File "/users/xxx/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2007, in __getattr__
        (type(self).__name__, name))                                                                             
AttributeError: 'DataFrame' object has no attribute 'assign'  

Input

$ cat test1

r_id       g_id exp
r1      g1      20
r2      g1      30
r3      g1      1
r4      g1      3

$ cat test2

r_id       gid exp
r1      g2      20
r2      g2      30
r3      g2      1
r4      g2      3

$ cat test3

r_id       g_id exp
r1      g3      30
r2      g3      40
r3      g3      11
r4      g3      32

Desired Ouput

  r_id  test3  test2  test1
0        r1        30        20        20
1        r2        40        30        30
2        r3        11         1         1
3        r4        32         3         3

Working code (except column naming)

import os
import glob
import pandas as pd

files = glob.glob(r'/path/test*')

def merge_files(files, **kwargs):
    dfs = []
    for f in files:
        dfs.append(
            pd.read_csv(f, sep='\t', usecols=['r_id', 'exp'])
              #.assign(col=0)
              .rename(columns={'col_name':os.path.splitext(os.path.basename(f))[0]})
              .set_index(['repeat_id'])
        )
    return pd.concat(dfs, axis=1)


result = merge_files(files).reset_index()
print(result)
user1703276
  • 353
  • 1
  • 4
  • 14

1 Answers1

2

You need change exp as column name for rename:

def merge_files(files, **kwargs):
    dfs = []
    for f in files:
        dfs.append(
            pd.read_csv(f, sep='\t', usecols=['r_id', 'exp'], index_col=['r_id'])
              .rename(columns={'exp':os.path.splitext(os.path.basename(f))[0]})
        )
    return pd.concat(dfs, axis=1)

result = merge_files(files).reset_index()
print(result)
  r_id  test1  test2  test3
0   r1     20     20     30
1   r2     30     30     40
2   r3      1      1     11
3   r4      3      3     32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I am afraid, your code is giving the same output as my script in the question. It is not renaming the column names using the file names. @jezrael – user1703276 Jun 01 '17 at 10:53
  • Do you change `col_name` to `exp` ? – jezrael Jun 01 '17 at 10:56
  • Sure done. One more thing, what if the number of test files are 10k (test1 to test10000) and each file has 200k rows (r1 to r200000)? @jezrael – user1703276 Jun 01 '17 at 13:50
  • Then better is some alternative, e.g. dask. Also you can check [this](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas). – jezrael Jun 01 '17 at 13:53