I am trying to write a script that will take, as input, the genes observed in any genome, count the number of times a given gene is observed, then combine all the input counts into a data frame of unique genomes in columns with unique genes in rows, with the count of each gene being in the matrix. I was able to write such as script that does this with a simple 2 genome example. In the attachd *gz file (tony_script.tar.gz), OUT.txt and OUT2.txt are the genes lists from two different genomes. The script with this information is called "gene_table.py". My script does what I want, but if I extend it to hundreds of genomes (or text files), it becomes difficult, or impractical, to load each text file uniquely. I am having difficultly figuring out how to do this over numerous text files. I figured out how to read several text files into python (glob), open, and read them.
But how to store their contents independently, and perform the counting/data frame functions?
#simple example for creating a genome-protein coding gene table
#using python dictinoaries
#start with just two genomes, reading in each separately
#reads in genes from sorted gene lists, counts genes, stores in
#dictionary, combines dictionaries into table
import pandas as pd
with open("OUT.txt") as f:
content = f.readlines()
content [x.strip() for x in content]
from collections import Counter
#count genes in first genome
genome_1 = Counter(content)
#open genome 2
with open("OUT2.txt") as f2:
content2 = f2.readlines()
content2 = [x.strip() for x in content2]
from collections import Counter
genome_2 = Counter(content2)
mydicts = [genome_1, genome_2]
#make the dataframe
df = pd.concat([pd.Series(d) for d in mdicts],
axis=1).fillna(0).T
df.index = ['genome_1', 'genome_2']
print(df)