3

Have multiple files(20) in a directory with 2 columns, for eg

transcript_id value
ENMUST001     2
ENMUST003     3
ENMUST004     5

number of rows differ in each file what I would like to do is merge all the 20 files in one huge matrix in this way

transcript_id value_file1 value_file2....value_file20
ENMUST001     2  3 
ENMUST003     3  4
ENMUST004     5  0

Collect all the ids from transcript_id column and corresponding values from each file(filename as column name) and if there is no value use 0.

I tried to do this using pandas,

import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df.set_index('transcript_id')
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 

Error:
ValueError: No objects to concatenate

Wondering if non pandas way is a better approach? any pseudo code is appreciated.

edit

output found

df.set_index('transcript_id')
print (df.shape)

    (921, 1)
    (1414, 1)
    (659, 1)
    (696, 1)
    (313, 1)
print (df.is_unique)
    (921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

New error
ValueError: All objects passed were None

duplicate print

before:  (921, 1)
after:  (914, 1)
before:  (1414, 1)
after:  (1410, 1)
before:  (659, 1)
after:  (658, 1)
before:  (696, 1)
after:  (694, 1)
before:  (313, 1)
after:  (312, 1)
sid
  • 33
  • 5

1 Answers1

3

The default behaviour for set_index is inplace=False. Try replacing df.set_index('transcript_id') with df = df.set_index('transcript_id'). Also you can remove duplicate values in the index using df = df[~df.index.duplicated(keep='first')].

import os
import glob
import pandas as pd

path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df = df.set_index('transcript_id') # set index
    df = df[~df.index.duplicated(keep='first')] # remove duplicates
    df.columns = [os.path.split(file)[-1]] # set column name to filename
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 
vahndi
  • 1,055
  • 8
  • 16
  • hi, thank you for reply i found this error `ValueError: Shape of passed values is (5, 3063), indices imply (5, 3056)` when i change it, i also try `df_combine = pd.concat(df_files, axis=0).fillna(0)` i get only two column as output – sid Aug 05 '17 at 17:02
  • Can you add a line `print(df.shape)` after `df = df.set_index('transcript_id')` and tell me what you get? – vahndi Aug 05 '17 at 17:10
  • hi, i put output df.shape in my question – sid Aug 05 '17 at 17:13
  • 1
    Sounds like you might have duplicate indexes. Try adding `print(df.index.is_unique)`. If it is `False` for any df then you can do `df.drop_duplicates(inplace=True)`. – vahndi Aug 05 '17 at 17:23
  • Updated output in question sorry i don't know why I get this error – sid Aug 05 '17 at 17:31
  • 1
    No worries - I might have led you astray there. Try `df = df[~df.index.duplicated(keep='first')]` to remove duplicate index entries (from https://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries/34297689#34297689) – vahndi Aug 05 '17 at 17:38
  • 1
    added in answer above – vahndi Aug 05 '17 at 17:53
  • I accept you answer as it is correct. many thanks as you replied on a weekend. – sid Aug 05 '17 at 18:01
  • Not a problem!! – vahndi Aug 05 '17 at 18:11
  • I had a question when i do total for all the rows before merge it is `4003` where as when I do total after merge `3056` am I loosing some rows? or can i see why are the rows missing – sid Aug 05 '17 at 19:50
  • Hard to say without having the data. Try printing the number of rows in each dataframe before and after dropping duplicates. You should also consider whether you really want to drop duplicates or whether to perform an aggregation operation like sum or mean using the groupby function instead of just setting the index and dropping duplicates. – vahndi Aug 05 '17 at 20:11
  • you right, i am finding duplicate (see question I print before after drop method) i do not want drop them this is tricky. any method I can use – sid Aug 05 '17 at 21:38