0

I am trying to write a script that will take, as input, the genes observed in any genome, count the number of times a given gene is observed, then combine all the input counts into a data frame of unique genomes in columns with unique genes in rows, with the count of each gene being in the matrix. I was able to write such as script that does this with a simple 2 genome example. In the attachd *gz file (tony_script.tar.gz), OUT.txt and OUT2.txt are the genes lists from two different genomes. The script with this information is called "gene_table.py". My script does what I want, but if I extend it to hundreds of genomes (or text files), it becomes difficult, or impractical, to load each text file uniquely. I am having difficultly figuring out how to do this over numerous text files. I figured out how to read several text files into python (glob), open, and read them.

But how to store their contents independently, and perform the counting/data frame functions?

 #simple example for creating a genome-protein coding gene table 
 #using python dictinoaries
 #start with just two genomes, reading in each separately
 #reads in genes from sorted gene lists, counts genes, stores in 
 #dictionary, combines dictionaries into table
 import pandas as pd
 with open("OUT.txt") as f:
     content = f.readlines()
 content [x.strip() for x in content]
 from collections import Counter
 #count genes in first genome
 genome_1 = Counter(content)

 #open genome 2
 with open("OUT2.txt") as f2:
      content2 = f2.readlines()
 content2 = [x.strip() for x in content2]
 from collections import Counter
 genome_2 = Counter(content2)
 mydicts = [genome_1, genome_2]
 #make the dataframe
 df = pd.concat([pd.Series(d) for d in mdicts], 
 axis=1).fillna(0).T
 df.index = ['genome_1', 'genome_2']
 print(df)
Roman Pokrovskij
  • 9,449
  • 21
  • 87
  • 142
tonyMane
  • 1
  • 1
  • See https://stackoverflow.com/q/1373164/3001761; **TL;DR**, use a dictionary or other mapping type. – jonrsharpe Jan 03 '19 at 20:51
  • Sorry, new to this. The above text files are gene names (example, hypotethical protein, amoA) one per line. – tonyMane Jan 03 '19 at 20:52
  • How are you extracting the data into the OUT.txt and OUT2.txt files? Did you consider just saving each data/ genes lists into memory/ a variable or data type of some sort temporarily, perform your counting of the genes/ your manipulation of the data and once done update the variable/ data type with the next genes list? This could save you the overhead of having to deal with the maintenance, reading and writing of files. –  Jan 03 '19 at 21:16

0 Answers0