1

I am trying to use pandas to generate a large data frame for my data analysis. the data looks something like:

RNAME Start End Count
Chr1   1     3    1
Chr1   2     5    1
Chr1   4     6    1
Chr1   5     9    2
Chr1   2     5    1
...

I found that if I increase the upper limit of number of lines to like 10^7, the program will run like forever and won't be able to finish the task. So I inserted the time check code in the code and found that as the added total lines number increases, the time for adding the same number of lines increases dramatically. So could anyone help me to solve this problem? Here is the code:

import pandas as pd
import time

start1 = time.perf_counter()
step_size = 5
new_df = pd.DataFrame(columns=['RNAME', 'start', 'end', 'central'])
start = 0
windowsize = 100
i = 0
while i <= 200000:
    if i == 0:
        t1 = time.perf_counter()
    if i == 10000:
        t2 = time.perf_counter()
    if i == 20000:
        t3 = time.perf_counter()
    if i == 30000:
        t4 = time.perf_counter()
    if i == 40000:
        t5 = time.perf_counter()
    if i == 50000:
        t6 = time.perf_counter()
    if i == 60000:
        t7 = time.perf_counter()
    if i == 70000:
        t8 = time.perf_counter()
    if i == 80000:
        t9 = time.perf_counter()
    if i == 90000:
        t10 = time.perf_counter()
    if i == 100000:
        t11 = time.perf_counter()
    if i == 110000:
        t12 = time.perf_counter()
    if i == 120000:
        t13 = time.perf_counter()
    if i == 130000:
        t14 = time.perf_counter()
    if i == 140000:
        t15 = time.perf_counter()
    if i == 150000:
        t16 = time.perf_counter()
    if i == 160000:
        t17 = time.perf_counter()
    if i == 170000:
        t18 = time.perf_counter()
    if i == 180000:
        t19 = time.perf_counter()
    if i == 190000:
        t20 = time.perf_counter()
    df = pd.DataFrame([['chr1', start+i, i + start + windowsize - 1, i + start + round(windowsize/2)-1]], columns=['RNAME', 'start', 'end', 'central'])
    i += step_size
    new_df = pd.concat([new_df, df], sort=False)
new_df.reset_index(inplace=True, drop=True)
new_df.to_csv(f'chr1_{start}_window_{windowsize}_step_{step_size}.bed', header=False, index=False, sep='\t')
end1 = time.perf_counter()
print(f'process finished in {round(end1 - start1, 2)} second(s)') 
print(f'the first 10000 lines finished in {round(t2-t1, 2)} secs')
print(f'the second 10000 lines finished in {round(t3-t2, 2)} secs')
print(f'the third 10000 lines finished in {round(t4-t3, 2)} secs')
print(f'the fourth 10000 lines finished in {round(t5-t4, 2)} secs')
print(f'the fifth 10000 lines finished in {round(t6-t5, 2)} secs')
print(f'the sixth 10000 lines finished in {round(t7-t6, 2)} secs')
print(f'the seventh 10000 lines finished in {round(t8-t7, 2)} secs')
print(f'the eighth 10000 lines finished in {round(t9-t8, 2)} secs')
print(f'the nineth 10000 lines finished in {round(t10-t9, 2)} secs')
print(f'the tenth 10000 lines finished in {round(t11-t10, 2)} secs')
print(f'the eleventh 10000 lines finished in {round(t12-t11, 2)} secs')
print(f'the twelve 10000 lines finished in {round(t13-t12, 2)} secs')
print(f'the thirteenth 10000 lines finished in {round(t14-t13, 2)} secs')
print(f'the fourteenth 10000 lines finished in {round(t15-t14, 2)} secs')
print(f'the fifteenth 10000 lines finished in {round(t16-t15, 2)} secs')
print(f'the sixteenth 10000 lines finished in {round(t17-t16, 2)} secs')
print(f'the seventeenth 10000 lines finished in {round(t18-t17, 2)} secs')
print(f'the eighteenth 10000 lines finished in {round(t19-t18, 2)} secs')
print(f'the nineteenth 10000 lines finished in {round(t20-t19, 2)} secs')

And here are the results:

process finished in 204.99 second(s)
the first 10000 lines finished in 2.6 secs
the second 10000 lines finished in 3.33 secs
the third 10000 lines finished in 4.06 secs
the fourth 10000 lines finished in 4.86 secs
the fifth 10000 lines finished in 5.64 secs
the sixth 10000 lines finished in 6.5 secs
the seventh 10000 lines finished in 7.28 secs
the eighth 10000 lines finished in 8.14 secs
the nineth 10000 lines finished in 8.81 secs
the tenth 10000 lines finished in 9.71 secs
the eleventh 10000 lines finished in 10.4 secs
the twelve 10000 lines finished in 11.56 secs
the thirteenth 10000 lines finished in 12.25 secs
the fourteenth 10000 lines finished in 13.18 secs
the fifteenth 10000 lines finished in 13.96 secs
the sixteenth 10000 lines finished in 14.76 secs
the seventeenth 10000 lines finished in 15.83 secs
the eighteenth 10000 lines finished in 16.66 secs
the nineteenth 10000 lines finished in 17.24 secs
Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
Jiang Xu
  • 91
  • 1
  • 10
  • Hey Jonathon, thank you very much for your comment. It really solved a big puzzle that has been around me for quite some time. Before I just 'Take for granted' that python is slow. Now everything becomes crystal clear!! – Jiang Xu Feb 09 '20 at 02:14

0 Answers0