3

I have a very long string, almost a megabyte long, that I need to write to a text file. The regular

file = open("file.txt","w")
file.write(string)
file.close()

works but is too slow, is there a way I can write faster?

I am trying to write a several million digit number to a text file the number is on the order of math.factorial(67867957)

This is what shows on profiling:

    203 function calls (198 primitive calls) in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 re.py:217(compile)
        1    0.000    0.000    0.000    0.000 re.py:273(_compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:172(_compile_charset)
        1    0.000    0.000    0.000    0.000 sre_compile.py:201(_optimize_charset)
        4    0.000    0.000    0.000    0.000 sre_compile.py:25(_identityfunction)
      3/1    0.000    0.000    0.000    0.000 sre_compile.py:33(_compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:341(_compile_info)
        2    0.000    0.000    0.000    0.000 sre_compile.py:442(isstring)
        1    0.000    0.000    0.000    0.000 sre_compile.py:445(_code)
        1    0.000    0.000    0.000    0.000 sre_compile.py:460(compile)
        5    0.000    0.000    0.000    0.000 sre_parse.py:126(__len__)
       12    0.000    0.000    0.000    0.000 sre_parse.py:130(__getitem__)
        7    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
      3/1    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        1    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
       10    0.000    0.000    0.000    0.000 sre_parse.py:183(__next)
        2    0.000    0.000    0.000    0.000 sre_parse.py:202(match)
        8    0.000    0.000    0.000    0.000 sre_parse.py:208(get)
        1    0.000    0.000    0.000    0.000 sre_parse.py:351(_parse_sub)
        2    0.000    0.000    0.000    0.000 sre_parse.py:429(_parse)
        1    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        1    0.000    0.000    0.000    0.000 sre_parse.py:726(fix_flags)
        1    0.000    0.000    0.000    0.000 sre_parse.py:738(parse)
        3    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        1    0.000    0.000    0.000    0.000 {built-in method compile}
        1    0.001    0.001    0.001    0.001 {built-in method exec}
       17    0.000    0.000    0.000    0.000 {built-in method isinstance}
    39/38    0.000    0.000    0.000    0.000 {built-in method len}
        2    0.000    0.000    0.000    0.000 {built-in method max}
        8    0.000    0.000    0.000    0.000 {built-in method min}
        6    0.000    0.000    0.000    0.000 {built-in method ord}
       48    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        5    0.000    0.000    0.000    0.000 {method 'find' of 'bytearray' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
João Areias
  • 1,192
  • 11
  • 41
  • Have you considered encoding it yourself first, and then writing it in binary mode? – Ignacio Vazquez-Abrams Feb 09 '15 at 22:23
  • 7
    megabyte is not "huge". Are you sure your disk can work faster than python writes? Could your provide a standalone benchmark e.g., `python3 -c'open('file', 'w').write("a"*1000000)'` What time does it on your computer? What is the desired time? – jfs Feb 09 '15 at 22:27
  • Well I know it's not huge but it's large enough to take a long time in my laptop computer. I don't know what I'm doing wrong but it's really slow, taking hours to write. But then again, I'm just a beginner and probably am doing something wrong here – João Areias Feb 09 '15 at 22:30
  • 6
    lol there is no way it should take hours to write 1 MB ... it should take at most a few seconds (and thats being generous) ... as @J.F.Sebastian mentions please profile it with something simple... – Joran Beasley Feb 09 '15 at 22:30
  • If you have a computer that can run a modern web browser, then you have a computer that can write 1MB to disk quickly enough. If your program is taking a long time to run, then something else is the bottleneck. Find out what that bottleneck is. –  Feb 09 '15 at 22:33
  • Well... Where are you getting this String from.... ? Don't tell me that it is from some HTTP API call. – sarveshseri Feb 09 '15 at 22:34
  • as @JoranBeasley mentioned it can't be that long... must be another problem. maybe the so called string isn't what you expect? – alonisser Feb 09 '15 at 22:34
  • I know, that's what I think is weird, my laptop is not that slow. I don't know if is something of my code but my code uses the regular way I showed before, I don't know if it's because is just one variable that is that long and this is the problem, all I know is that is taking too long and using almost all my RAM – João Areias Feb 09 '15 at 22:35
  • I'm calculating a primorial of a large prime (product of all primes less than or equal to a number) that's where the string comes from. Could it be that it's taking so long no to write but to convert from integer to string? – João Areias Feb 09 '15 at 22:36
  • I suspect the calculation is what takes a long time ... not the writing to disk (isnt that essentially RSA encryption cracking ... which takes until the heat death of the universe?) the conversion to string is also certainly not what takes a long time – Joran Beasley Feb 09 '15 at 22:37
  • Well, the calculation is already made, I made it print all the time it makes a multiplication so I can keep track of it and the calculation for the prime I inputed is done (yes, I want to use that later for research in security and RSA cracking but no, I'm not cracking right now) – João Areias Feb 09 '15 at 22:41
  • I guarantee the calculation is what is taking forever ... not the converting to string, nor the writing to disc.... you can verify this easily by saving the calculation in a variable and printing the len(str(my_var)) abd how long it took ... and then writing `str(my_var)` to file – Joran Beasley Feb 09 '15 at 22:42
  • I am keeping track of every time it does a calculation and I can garantee that the calculation is over, my output is that is a really long number which I want to write to a file and is taking forever. The calculation did take a very long time but I made the computer print every time it went thru the loop the number in which it were in, the last number printed was the last number it had to go thru – João Areias Feb 09 '15 at 22:44
  • 4
    What does your profiling show? https://docs.python.org/2/library/profile.html –  Feb 09 '15 at 22:48
  • All I see is `0.001 Seconds` . – sarveshseri Feb 09 '15 at 23:03
  • @tristan: `profile` measures CPU time. It is probably not appropriate to measure I/O performance. – jfs Feb 09 '15 at 23:07
  • 1
    note: `/usr/bin/time python3 -c'open("/tmp/file", "w").write("a"*500*1000000)'` takes under a second on my machine. OS probably uses a file cache in memory. `ls -l /tmp/file` confirms 500M file. – jfs Feb 09 '15 at 23:08
  • Maybe the conversion to string would be quicker if you used the `Decimal` module for the calculations? – Mark Ransom Feb 09 '15 at 23:10
  • 1
    Seriously.... Can you calculate `math.factorial(67867957)` on your machine ? Isn't `Factorial(20)` the last factorial to be in `64-bits`. And `34!` is `128 bit` – sarveshseri Feb 09 '15 at 23:15
  • It's not factorial, it's primorial but grows almost as fast, but yes, the calculation is already done – João Areias Feb 09 '15 at 23:16
  • even though its a lousy answer im leaving my answer for now ... – Joran Beasley Feb 09 '15 at 23:17
  • @SarveshKumarSingh Python 3 doesn't have fixed-size integers - everything is a bignum. It might take a damned long time, but yes, you can calculate `math.factorial(67867957)`. – senshin Feb 09 '15 at 23:19
  • @senshin try it on your machine. I don't think every machine will agree. – sarveshseri Feb 09 '15 at 23:20
  • Thanks @JoranBeasley for the help. I'm not sure there is a way to go faster, I guess I will just have to wait and get over it. – João Areias Feb 09 '15 at 23:20
  • have you tried `pickle.dump(n, file)` or `n.to_bytes((n.bit_length() + 7) // 8, 'big')` instead of `str(n)`? In general, `gmpy2` may be faster for large integers (millions of digits). – jfs Feb 09 '15 at 23:25
  • @J.F.Sebastian no, I haven't will give it a try – João Areias Feb 09 '15 at 23:28
  • 5
    `/usr/bin/time python -c'import gmpy2; open("/tmp/file", "w").write(str(gmpy2.fac(67867957)))'` takes less than 10 minutes on my machine. `/tmp/file` contains 500M digits number – jfs Feb 09 '15 at 23:32
  • @J.F.Sebastian that sounds like the beginning of an answer to me. – Mark Ransom Feb 09 '15 at 23:37
  • @J.F.Sebastian dang you should really put that as an answer ... as now that this quesiton has been more fully flushed out it may actually be usefull – Joran Beasley Feb 09 '15 at 23:37
  • FWIW, the maintainer of [gmpy2](https://pypi.python.org/pypi/gmpy2), Case Van Horsen, is an [SO member](http://stackoverflow.com/users/224574/casevh). – PM 2Ring Feb 10 '15 at 06:33
  • it seems [`str(n)` is a quadratic operation in Python](http://bugs.python.org/issue3451#msg84704). `python3 -c'import math; open("/tmp/file2", "w").write(str(math.factorial(678679 57)))'` is still running after 12 hours. For comparison: `pickle.dump(math.factorial(67867957), open("/tmp/file3.pickle", "wb"))` took 6 hours (`pickle.dump` along takes less than a second). – jfs Feb 10 '15 at 12:47
  • 3
    As answered by @J.F.Sebastian, the fundamental issue is that `str(long)` has quadratic running time. I am biased since I maintain `gmpy2`, but if you plan to work with such huge numbers, you really should be using 'gmpy2`. BTW, the current development version (2.1.x) includes the `primorial` function. `gmpy2.primorial(67867957)` takes about 3.5 seconds. – casevh Feb 13 '15 at 08:07

2 Answers2

5

Your issue is that str(long) is very slow for large intergers (millions of digits) in Python. It is a quadratic operation (in number of digits) in Python i.e., for ~1e8 digits it may require ~1e16 operations to convert the integer to a decimal string.

Writing to a file 500MB should not take hours e.g.:

$ python3 -c 'open("file", "w").write("a"*500*1000000)'

returns almost immediately. ls -l file confirms that the file is created and it has the expected size.

Calculating math.factorial(67867957) (the result has ~500M digits) may take several hours but saving it using pickle is instantaneous:

import math
import pickle

n = math.factorial(67867957) # takes a long time
with open("file.pickle", "wb") as file:
    pickle.dump(n, file) # very fast (comparatively)

To load it back using n = pickle.load(open('file.pickle', 'rb')) takes less than a second.

str(n) is still running (after 50 hours) on my machine.

To get the decimal representation fast, you could use gmpy2:

$ python -c'import gmpy2;open("file.gmpy2", "w").write(str(gmpy2.fac(67867957)))'

It takes less than 10 minutes on my machine.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

ok this is really not an answer it is more to prove your reasoning for the delay wrong

first test write speed of a big string

 import timeit
 def write_big_str(n_bytes=1000000):
     with open("test_file.txt","wb") as f:
          f.write("a"*n_bytes)
 print timeit.timeit("write_big_str()","from __main__ import write_big_str",number=100)

you should see a fairly respectable speed (and thats to repeat it 100 times)

next we will see how long it takes to convert a very big number to a str

import timeit,math
n = math.factorial(200000)
print timeit.timeit("str(n)","from __main__ import n",number=1)

it will probably take ~10seconds (and that is a million digit number) , which granted is slow ... but not hours slow (ok its pretty slow to convert to string :P... but still shouldnt take hours) (well it took more like 243 seconds for my box i guess :P)

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Here is the thing, 200000! looks tiny compared to the number I'm writing. My number is a bit smaller than 67867957! The python had no problem writing 49979687# (# is symbol for primorial) which is approximately 49979687! – João Areias Feb 09 '15 at 23:02
  • 1
    ahh now we are starting to get helpful data ... indeed the string conversion may take a looooong time ... also that has alot more than a million digits ... – Joran Beasley Feb 09 '15 at 23:02
  • 3
    @JoãoAreias 67867957! is about 500 million decimal digits in length, i.e. we're talking about ~500 MB, not 1 MB. – senshin Feb 09 '15 at 23:04
  • Oh, sorry my mistake. I read it wrong when I was going to write here, is not 1MB is more like 100, primorial still smaller than factorial(even though still huge) is there any way of speeding up the process or do I just have to wait and deal with it? – João Areias Feb 09 '15 at 23:06
  • oh well yeah 100Million digit number will take a while to convert to a string ... – Joran Beasley Feb 09 '15 at 23:07
  • unrelated: you could pass a function object itself e.g., `timeit.timeit(write_big_str, number=100)/100.` – jfs Feb 09 '15 at 23:37
  • @J.F.Sebastian dang i learn something new sometimes I guess :P thanks for the tip :) – Joran Beasley Feb 09 '15 at 23:39