0

I am currently working with databases for a bioinfomatics class and I am having trouble formatting my SQL output. My python script queries for these tuples. Instead of my output here:

CENPVP2 441495  9606    NR_033773.1 None    NC_000023.11    None
CENPVP2 441495  9606    NR_033773.1 None    NT_011630.15    None
CENPVP2 441495  9606    None    None    NG_022599.1 None
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NC_000023.11    12477932
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NC_000023.11    16382448
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NC_000023.11    18976975
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NG_027735.1 12477932
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NG_027735.1 16382448
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NG_027735.1 18976975
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NT_011786.17    12477932
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NT_011786.17    16382448
CT47A11 255313  9606    NM_173571.2 NP_775842.2 NT_011786.17    18976975
CT47A11 255313  9606    None    None    NG_027735.1 12477932
CT47A11 255313  9606    None    None    NG_027735.1 16382448
CT47A11 255313  9606    None    None    NG_027735.1 18976975

Where each field is separated by a tab character I need this, where each field is still separated by a tab character, but now there are multi-valued fields where the values are separated by pipes and null values are represented by dashes:

CENPVP2 441495 9606 NR_033773.1 - NC_000023.11|NG_022599.1|NT_011630.15 -
CT47A11 255313 9606 NM_173571.2 NP_775842.2 NC_000023.11|NG_027735.1|NT_011786.17 12477932|16382448|18976975

What is the best way to format my output to match the second table?

Here is my python script:

   import sys
   import getopt
   import psycopg2

def writerows(row, outFile):
    outFile.write("%s\t" % row[1])
    outFile.write("%s\t" % row[0])
    outFile.write("%s\t" % row[2])
    outFile.write("%s\t" % row[3])
    outFile.write("%s\t" % row[5])
    outFile.write("%s\t" % row[4])
    outFile.write("%s\n" % row[6])

def usage(err):
    print("I will handle this later")

def main():
    inFile = sys.stdin
    outFile = sys.stdout

try:
    opts, args = getopt.getopt(sys.argv[1:], "i:o:")
except getopt.GetoptError as err:
    usage(err)
    sys.exit(2)
for (opt, arg) in opts:
    if(opt == "-i"):
        inFile = open(arg, "r")
    if(opt == "-o"):
        outFile = open(arg, "w")

line = inFile.readline()
line = line.replace("\n", "")
conn = psycopg2.connect("dbname=********* user=*********** "
                        "password=********** host=localhost")
cursor = conn.cursor()
    while (line):
    cursor.execute("SELECT DISTINCT geneinfo.gene_id, geneinfo.symbol, "
                   "geneinfo.tax_id, gene2refseq.rna_accession, "
                   "gene2refseq.gen_accession, "
                   "gene2refseq.pro_accession, gene2pubmed.pubmed_id FROM "
                   "geneinfo LEFT JOIN gene2refseq ON "
                   "geneinfo.gene_id = gene2refseq.gene_id LEFT JOIN "
                   "gene2pubmed ON geneinfo.gene_id = gene2pubmed.gene_id "
                   "WHERE geneinfo.symbol ILIKE '"+line+"' OR "
                   "geneinfo.synonyms ILIKE '%"+line+"%' ORDER BY "
                   "geneinfo.symbol ASC, geneinfo.tax_id ASC;")
    result = cursor.fetchone()

    if result:
        while result:
            writerows(result, outFile)
            result = cursor.fetchone()
    else:
        outFile.write("\n")

    line = inFile.readline()
    line = line.replace("\n", "")

cursor.close()
conn.close()

   if (__name__=='__main__'):
    main()
Philip Kirkbride
  • 21,381
  • 38
  • 125
  • 225
aschenk04
  • 21
  • 3
  • Where are the tuples? Please show how data is derived? From database? From text file? From pandas? A working script would help for reproducibility. – Parfait Mar 23 '17 at 23:32
  • Data is derived from databases, but the specific gene symbol queries are from a text file, hence the "opts" arguments. – aschenk04 Mar 24 '17 at 00:34
  • Adjust your SQL. Look into [group concat methods](http://stackoverflow.com/questions/2560946/postgresql-group-concat-equivalent) – Parfait Mar 24 '17 at 00:49
  • Thank you, Parfait! I will try that. – aschenk04 Mar 24 '17 at 01:13

1 Answers1

0

What Parfait said in the comments. Query would look something like:

SELECT DISTINCT gi.gene_id, gi.symbol, gi.tax_id,
    rs.rna_accession, rs.gen_accession,
    array_to_string(array_agg(rs.pro_accession), '|')),
    array_to_string(array_agg(pm.pubmed_id), '|'))
FROM geneinfo gi
LEFT JOIN gene2refseq rs ON gi.gene_id=rs.gene_id
LEFT JOIN gene2pubmed pm ON gi.gene_id=pm.gene_id
WHERE gi.symbol ILIKE 'SOMESTUFF' OR
gi.synonyms ILIKE 'OTHERSTUFF'
GROUP BY 1,2,3,4,5
ORDER BY gi.symbol ASC, gi.tax_id ASC;
systemjack
  • 2,815
  • 17
  • 26