I have multiple files with the following naming convention.
ENCSR000EQO_0_0.txt
ENCSR000DIA_0_0.txt
ENCSR000DIA_1_1.txt
ENCSR000DIA_2_1.txt
ENCSR000DIM_0_0.txt
ENCSR000DIM_1_1.txt
ENCSR000AIB_0_0.txt
ENCSR000AIB_1_1.txt
ENCSR000AIB_2_1.txt
ENCSR000AIB_3_1.txt
I want to merge them as dataframes using pandas according to the file name, so I would have 4 resulting dataframes. And then for each of these 4, I want to groupby the gene(GeneName) column. Since the same gene will appear multiple times.
They all have the same columns in the same order. I can merge all 10 together at once, but I couldn't figure it out how to merge by name.
path = '/renamed/'
print os.listdir(path)
df_merge = None
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df = df.groupby('GeneName').agg(np.mean)
print df
Thank you for any input.