-1

I have the following code which is written to parse VCF (Variant call format) file:

Python code:

import vcf
import psycopg2
datalist  = []
def create_snp_tables() :
    drop_table_query = 'DROP TABLE IF EXISTS sampletable;'
    cursor.execute(drop_table_query)
    create_value_table_query = '''CREATE TABLE IF NOT EXISTS sampletable (as_ID INT, as_NM TEXT, as_DT_ID INT, as_DT_NM TEXT, VCF_ID TEXT, SAMPLE_ID TEXT, VARIANT_ID TEXT, as_DT_LINE_SEQ INT, DATE_START DATE, DATE_END DATE, as_DT_VAL_SEQ INT, as_DT_VA_NM TEXT, as_DT_VALUE TEXT); '''
    cursor.execute(create_value_table_query)
    conn.commit()
def createtupplelist(as_id, vcf_id,as_dt_nm, sample_id, as_dt_va_nm, as_dt_value, as_va_seq, as_dt_va_line_seq):
    variant_id= 'variant_id'
    as_nm = 'as_name'
    datalist.append("({0},'{1}','{2}','{3}','{4}','{5}','{6}','{7}','{8}','{9}','{10}')".format(as_id,str(as_nm),'1',as_dt_nm,vcf_id,sample_id,variant_id,as_dt_va_line_seq,as_va_seq,as_dt_va_nm,variable_value))
    if len(datalist)==20:
        insertdata()
def insertdata():
    global datalist 
    iter_datalist=iter(datalist) 
    args_str = ','.join(iter_datalist)
    cursor.execute("INSERT INTO sampletable(as_ID,as_NM,as_DT_ID,as_DT_NM,VCF_ID,SAMPLE_ID,Variant_ID,as_DT_LINE_SEQ,as_DT_VAL_SEQ,as_DT_VA_NM,as_DT_VALUE) VALUES "+args_str)
    print("inserted")
    conn.commit()
    datalist=[]
#read vcf file using pyvcf library
file_name  = 'sample.vcf'
vcf_reader = vcf.Reader(open(file_name, 'r'))
conn = psycopg2.connect(host="localhost",database="mydb", user="postgres", password="pgAdmin")
cursor = conn.cursor()
create_snp_tables()
line_index = 0
as_dt_variant = 'Variant'
index = 0
for record in vcf_reader :
    index=index+1
    line_index += 1
    sample_name = ''
    variable_value = record.CHROM
    variable_name = "CHROM"
    createtupplelist('1', file_name,  as_dt_variant, sample_name, variable_name, variable_value,  str(index), str(line_index))

This is a sample file which I pass through the script:

Sample VCF File:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  BA12878.40x.S7508
    chr1    10069   .   A   AC  136.17  RF  AC=1;AN=2;CRF=0.817;DP=7;GC=0.505;MQ=70.3709;MQ0=770;NS=7;QD=1.945;STR_LENGTH=90;STR_PERIOD=6   GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT   1|0:136:70:36:10069:99:309:944:0.327:0.046:.:0.97:164:0.165:0.000:2.45:RF
    chr1    10069   .   A   AC  136.17  RF  AC=1;AN=2;CRF=0.817;DP=7;GC=0.505;MQ=70.3709;MQ0=770;NS=7;QD=1.945;STR_LENGTH=90;STR_PERIOD=6   GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT   1|0:136:70:36:10069:99:309:944:0.327:0.046:.:0.97:164:0.165:0.000:2.45:RF
    chr1    10069   .   A   AC  136.17  RF  AC=1;AN=2;CRF=0.817;DP=7;GC=0.505;MQ=70.3709;MQ0=770;NS=7;QD=1.945;STR_LENGTH=90;STR_PERIOD=6   GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT   1|0:136:70:36:10069:99:309:944:0.327:0.046:.:0.97:164:0.165:0.000:2.45:RF
    chr1    10069   .   A   AC  136.17  RF  AC=1;AN=2;CRF=0.817;DP=7;GC=0.505;MQ=70.3709;MQ0=770;NS=7;QD=1.945;STR_LENGTH=90;STR_PERIOD=6   GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT   1|0:136:70:36:10069:99:309:944:0.327:0.046:.:0.97:164:0.165:0.000:2.45:RF
    chr1    10069   .   A   AC  136.17  RF  AC=1;AN=2;CRF=0.817;DP=7;GC=0.505;MQ=70.3709;MQ0=770;NS=7;QD=1.945;STR_LENGTH=90;STR_PERIOD=6   GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT   1|0:136:70:36:10069:99:309:944:0.327:0.046:.:0.97:164:0.165:0.000:2.45:RF

OUTPUT in my Postgres table - sampletable

as_id   as_nm       as_dt_id    as_dt_nm    vcf_id      sample_id   variant_id  as_dt_line_seq  date_start  date_end    as_dt_val_seq   as_dt_va_nm     as_dt_value

1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               CHROM           chr1
1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               POS             10069
1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               ID              None
1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               REF             A
1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               ALT             AC
1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               QUAL            136.17
1       as_name     2           Variant     sample.vcf  ""          variant_id  1               null        null        1               FILTER          RF

My Python code is working very slow. It is inserting about 1000 records in 5 minutes. I have more than 5 million records.

I am looking for some help to optimize the Python code to insert it faster. Please suggest.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
dang
  • 2,342
  • 5
  • 44
  • 91
  • Please indent your code properly. – PiCTo Dec 05 '19 at 17:34
  • @PiCTo done, please check – dang Dec 05 '19 at 20:13
  • Your code still isn't properly indented. Unlike C-like languages, indentation is an inherent part of the syntax of Python. Therefore, the behaviour of your code relies on it, hence the importance of proper formatting. Please fix it, include all relevant code and remove any code that doesn't directly concern your question. – PiCTo Dec 05 '19 at 21:15
  • Also, you state that you have written this code to parse VCF files. Have you considered using an [existing Python package](https://pyvcf.readthedocs.io/en/latest/) to do so instead of re-inventing the wheel? – PiCTo Dec 05 '19 at 21:16
  • @PiCTo i have used PyVCF to build this. – dang Dec 06 '19 at 03:01
  • 1
    This is not clearly visible from your code as it doesn't include the `import` statement while what you share has imported the PyVCF classes directly into its namespace. Do you expect every one of us to know that `Header` means `pyvcf.header`? We won't be able to help you if you don't format your code properly and include all relevant parts of it. – PiCTo Dec 06 '19 at 08:34
  • @PiCTo sorry about that. I have included the entire script so it is easy to understand. – dang Dec 06 '19 at 13:56
  • 1
    Scrap the code and tell us what you need to do - what data is taken from that VCF file and what needs to be saved it DB. You already showed example input now show us what needs to be taken from that input and what is the output aka how is it saved to db. This code is unreadable and broken. – BrainDead Dec 07 '19 at 19:38
  • @AlbertoPoljak let me prepare output and post it. What is broken in my current code? – dang Dec 07 '19 at 19:48
  • `sample_name` is fixed for that input and always going to be `BA12878.40x.S7508` is this by design? You're using a weird for loop and all that trouble just to get something that's constant.. – BrainDead Dec 07 '19 at 19:53
  • It would depend on the input vcf file header. If you check the vcf file, the last column header is BA12878.40x.S7508, that's why it is the input. If it changes in the incoming file, then it will change in the output. – dang Dec 07 '19 at 19:56
  • Hello @dang, please explain *what* takes time in your code. We cannot provide any architecture redesign or algorithm transformations without a precise knowledge of what needs to be optimized. Please also ensure that your code is [MCVE](https://stackoverflow.com/help/minimal-reproducible-example). – Tim Dec 08 '19 at 13:23
  • 1
    MCVE *implies* that you provide the expected output of your program. – Tim Dec 08 '19 at 13:24
  • @Tim I have included the expected output in the program. – dang Dec 09 '19 at 12:59
  • Thanks. Please tell us where your program is slow. Is it either in parsing the file or inserting in the database? Please benchmark your program with `python -m cProfile -s tottime -o profile.out ` then give the result of `python -m pstats profile.out`. – Tim Dec 09 '19 at 16:48
  • 1
    What I can see from here is that the parsing operations run quicly (about 1 000 000 lines per second on my computer, tested with a 5000 lines of your sample file). You say that your program runs at about 3.333 insertion per second. Since the parsing represent only 0.00033 % of the execution time of your program, you should consider optimizing the *database insertion* instead of the parsing algorithm. – Tim Dec 09 '19 at 17:19

2 Answers2

1

INSERT INTO XXX (column list) VALUES (values list) is not the optimal I replace this by syntax "INSERT INTO XXX VALUES", formatted list

It improve drastically performances 10 minutes ==> under 15 sec

(I did it in python 3.7.5)

# list have to be a string
datalist.append("({0},'{1}',{2},'{3}','{4}','{5}','{6}',...)".format(...))

def insertdata():
    global datalist
    iter_datalist=iter(datalist)  

    args_str = ','.join(iter_datalist)
    cursor.execute("INSERT INTO sampletable VALUES "+args_str)
    conn.commit()
    datalist=[]    
VoMimbre
  • 126
  • 3
  • I am not sure if I understood this. Are you able to share the answer specifically to my code? – dang Dec 12 '19 at 11:01
  • Performance bottleneck is probably in your one by one "insert into XXX values (a,b,c)" statement Try to change for and "INSERT INTO comment VALUES "+args_str much more efficient. it need to transform your insertdata() function this way def insertdata(): global datalist iter_datalist=iter(datalist) args_str = ','.join(iter_datalist) cursor.execute("INSERT INTO sampletable VALUES "+args_str) conn.commit() datalist=[] – VoMimbre Dec 12 '19 at 11:08
  • I already did that if you check my code. cursor.executemany("INSERT INTO sampletable(as_ID,as_NM,as_at_ID,as_at_NM,VCF_ID,SAMPLE_ID,Variant_ID,as_at_LINE_SEQ,as_at_VAL_SEQ,as_at_VA_NM,as_at_VALUE) VALUES (%s,%s,%s,%s, %s,%s,%s,%s,%s,%s,%s);", datalist) – dang Dec 12 '19 at 11:10
  • requirement is to fill all table fields in right order – VoMimbre Dec 12 '19 at 11:13
  • It is giving me error. args_str = ','.join(iter_datalist) TypeError: sequence item 0: expected string, tuple found – dang Dec 12 '19 at 11:27
  • change your append datalist.append((as_id,str(as_nm),as_at_id,as_at_nm,vcf_id,sample_id,variant_id,as_at_va_line_seq,as_va_seq,as_at_va_nm,variable_value)) to format it my append is like this : comment_queries.append("({0},'{1}',{2},'{3}','{4}','{5}','{6}',$${7}$$,'{8}','{9}','{10}')".format(issue_id,issue_key,comment_id,author_name,author_key,author_emailAddress,author_displayName,comment_body,author_active,comment_created,comment_updated)) – VoMimbre Dec 12 '19 at 11:34
  • Let me try this and get back to you shortly. Can you tell me what is the difference between the 2 inserts and why is one better than the other one? – dang Dec 12 '19 at 11:52
  • I think the difference is in multiple-insert syntax I based my idea on http://www.postgresqltutorial.com/postgresql-insert/ and https://stackoverflow.com/questions/20815028/how-do-i-insert-multiple-values-into-a-postgres-table-at-once/20820950 After moving my dictionary values to string list it become possible to handle various datatypes integer, varchar , text , date – VoMimbre Dec 12 '19 at 13:07
0

some ideas:

Try to work with transactions to speed up database operations. Each UPDATE statement must scan the entire table to find any row(s) that match the name. An index on the name column would prevent this and make the search much faster.

Avoid unwanted loops: https://www.monitis.com/blog/7-ways-to-improve-your-python-performance/

Try to use generators.

Optimized I/O operations in Python : https://towardsdatascience.com/optimized-i-o-operations-in-python-194f856210e0

Can you work from numpy array ? numpy + numba.

Maybe you can use short piece of code written in C/c++/rust and import them with cffi. have a look here: https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-insertions-971aff98eef2

I don't know if sqlalchemy slow down operations but i find it smart to work with database.Look if you can use pypy or nuitka also.

also: http://alimanfoo.github.io/2017/06/14/read-vcf.html I hope you will find something interesting there.

laticoda
  • 100
  • 14