Have multiple files(20) in a directory with 2 columns, for eg
transcript_id value
ENMUST001 2
ENMUST003 3
ENMUST004 5
number of rows differ in each file what I would like to do is merge all the 20 files in one huge matrix in this way
transcript_id value_file1 value_file2....value_file20
ENMUST001 2 3
ENMUST003 3 4
ENMUST004 5 0
Collect all the ids from transcript_id column and corresponding values from each file(filename as column name) and if there is no value use 0.
I tried to do this using pandas,
import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
df = pd.read_csv(file, sep='\t')
df.set_index('transcript_id')
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
Error:
ValueError: No objects to concatenate
Wondering if non pandas way is a better approach? any pseudo code is appreciated.
edit
output found
df.set_index('transcript_id')
print (df.shape)
(921, 1)
(1414, 1)
(659, 1)
(696, 1)
(313, 1)
print (df.is_unique)
(921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
New error
ValueError: All objects passed were None
duplicate print
before: (921, 1)
after: (914, 1)
before: (1414, 1)
after: (1410, 1)
before: (659, 1)
after: (658, 1)
before: (696, 1)
after: (694, 1)
before: (313, 1)
after: (312, 1)