3

I am trying to write a list of the numbers from 0 to 1000000000 as strings, directly to a text file. I would also like each number to have leading zeros up to ten digit places, e.g. 0000000000, 0000000001, 0000000002, 0000000003, ... n. However I find that it is taking much too long for my taste.

I can use seq, but there is no support for leading zeros and I would prefer to avoid using awk and other auxiliary tools to handle these tasks. I am aware of dramatic speed-up benefits just from coding this in C, but I don't want to resort to it. I was considering mapping some functions to a large list and executing them in a loop, however I only have 2GB of RAM available, so please keep this in mind when approaching my problem.

I am using Python-Progressbar, and I am getting an ETA of approximately 2 hours. I would appreciate it if someone can offer me some advice as to how to approach this problem:

pbar = ProgressBar(widgets=[Percentage(), Bar(), ' ', ETA(), ' ', FileTransferSpeed()], maxval=1000000000).start()

with open('numlistbegin','w') as numlist:
    limit, nw, pu = 1000000000, numlist.write, pbar.update

    for x in range(limit):
        nw('%010d\n'%(x,))
        pu(x)
pbar.finish()

EDIT: So I have discovered that the formatting (regardless of what programming language you are using), creates vast amounts of overhead. Seq gets the job done quickly, but much more slowly with the formatting option (-f). However, if anyone would like to offer a python solution nonetheless, it would be most welcome.

eazar001
  • 1,572
  • 2
  • 16
  • 29
  • 3
    Why do you need to do this? What use would such a file be? – Steven Rumbalski Jul 22 '13 at 15:10
  • Possibly, OP is using `tail -f` on the file to monitor progress – inspectorG4dget Jul 22 '13 at 15:10
  • @Steven Rumbalski, I am penetration testing my own network. – eazar001 Jul 22 '13 at 15:11
  • What if you wrote out a file approximately 1/1000 of the size and then add leading numbers to replicate it 1000-fold to make the entire list? The pasting of leading numbers would be cheap and could be threaded. You'd have to concat some files together at the end. – Jiminion Jul 22 '13 at 15:12
  • Also, your progress bar is updated on every write. No progress bar has that fine-grained resolution. `if x % 100000 == 0: pu(x)`. Otherwise you'll spend more time updating your progress bar than actually writing to file. – Steven Rumbalski Jul 22 '13 at 15:12
  • 1
    @eazar001: What are you going to do with this ridiculously huge file to test your network? If you're going to send it over the network, could you get away with constructing the data on the fly? – user2357112 Jul 22 '13 at 15:12
  • @Steven Rumbalski, thanks for pointing that out – eazar001 Jul 22 '13 at 15:14
  • @user2357112, what do you mean by on the fly? – eazar001 Jul 22 '13 at 15:16
  • @eazar001: Rather than making a file and sending it, send data as if you had made a file. – user2357112 Jul 22 '13 at 15:18
  • Or maybe just dump /dev/zero over the wire or something. – user2357112 Jul 22 '13 at 15:20
  • 1
    @user2357112, I understand it seems a little strange generating something that large and THEN sending it, however this initially started off as just a network issue, and now has become a python issue. The scope of my concerns, --has changed. – eazar001 Jul 22 '13 at 15:20
  • @user2357112, you have a great point though, and I will keep that in mind. – eazar001 Jul 22 '13 at 15:22
  • If you are going through with this, use `xrange(n)`. Constructing the list `range(n)` is an unnecessary tax on your system. – wflynny Jul 22 '13 at 15:35
  • @Bill, i'm using python 3 – eazar001 Jul 22 '13 at 15:38
  • @eazar001 Then nevermind :) – wflynny Jul 22 '13 at 15:39
  • 1
    @Jim, interesting concept, though not as elegant as I wish it to be, I'll take it into consideration though, thanks for the suggestion. – eazar001 Jul 22 '13 at 15:58

2 Answers2

3

FWIW seq does have a formatting option:

$ seq -f "%010g" 1 5
0000000001
0000000002
0000000003
0000000004
0000000005

Your long time might be memory-related. with large ranges, using xrange is more memory-efficient since it doesn't try to calculate and store the entire range in memory before it starts. See the post titled Should you always favor xrange() over range()?

Edit: Using Python 3 means xrange vs. range usage is irrelevant.

Community
  • 1
  • 1
brechin
  • 569
  • 4
  • 7
  • +1, I didn't know about the formatting option, however I still want to leave this question open for a bit, because I am interested in the direct python solutions. However, please keep in mind that I am doing this with python 3, therefore I don't have the option of using `xrange`. – eazar001 Jul 22 '13 at 16:06
  • Also large ints tend to be a little troublesome in python 2.x. Thank you very much! – eazar001 Jul 22 '13 at 16:07
  • Notice however that the slowdown is tremendous when running `seq` normally as opposed to passing the formatting option. It appears that the majority of the overhead responsible for the slowdown is not due to memory or anything else, but the leading zeros. I suppose this should have been obvious to me, but the source of the problem becomes even more salient now that I ran tests on `seq`. At least now I have an idea of where to start. – eazar001 Jul 22 '13 at 16:31
  • aha, 15 minutes is more than acceptable runtime, and this sort of job shouldn't be trifled with in python. Thank you for your help again. – eazar001 Jul 22 '13 at 17:18
  • 2
    @eazar001: Python 3's `range` has the behavior of Python 2's `xrange`. Since you are using 3.x this is a non-issue. – Steven Rumbalski Jul 22 '13 at 17:31
  • @StevenRumbalski I realized all this, which is why I paid special attention to tagging this as Python 3.x, as I knew many people would impulsively criticize the `xrange` vs. `range` issue. – eazar001 Jul 22 '13 at 17:55
2

I did some experimenting and noticed that writing in larger batches improved the performance by about 30%. I'm not sure why your code is taking 2 hours to generate the file -- unless the progress bar is killing performance. If so, you should apply the same batching logic to updating the progress bar. My old Windows box will create a file one-tenth the required size in about 73 seconds.

# Python 2.
# Change xrange to range for Python 3.

import time

start_time = time.time()

limit = 100000000  # 1/10 your limit.
skip  = 1000       # Batch size.

with open('numlistbegin', 'w') as fh:
    for i in xrange(0, limit, skip):
        batch = ''.join('%010d\n' % j for j in xrange(i, i + skip, 1))
        fh.write(batch)

print time.time() - start_time   # 73 sec. (106 sec. without batching).
FMc
  • 41,963
  • 13
  • 79
  • 132
  • Yes, I get an ETA of approximately 10 minutes using 1/10th of the limit of my original as well, it just apears that a ten fold increase from that point is a huge jump for the computer. Interesting about the batch sizes, I'll run some test of my own, and also look into the overhead incurred by the progressbar. – eazar001 Jul 22 '13 at 16:15
  • @eazar001 That doesn't make much sense. I have a lot of experience reading and writing very large text files in line-by-line fashion, and I've never observed slowdowns as the files get large. I ran the code I posted using the full size (`limit=1000000000`). It took about 700 seconds. Are you sure there's nothing else going on? – FMc Jul 22 '13 at 20:27
  • Thanks for following up, I'm running a test with limit back at 10^9 as you did. I'll try to get back to you with results. There could be something bottlenecking my I/O, I'm not sure if anything else is going on .... if you are indeed getting that kind of performance, then my first suspect would be my hard drive, which in that case might be indicative of a greater problem for me... – eazar001 Jul 22 '13 at 21:15
  • Now that's interesting running your code yields me a finishing time of ~16 minutes. Perhaps the progressbar, as Steven Rumbalski and you have suggested, is really killing the performance. – eazar001 Jul 22 '13 at 21:36
  • thanks i gave credit to your answer in the end because of the results I achieved with your batch (more than doubled the time without batching), and you are right about the results being unreasonable. Your code also illustrated that the progressbar was killing the performance as well. So I have it so that it now only updates progress every 1024 increments (arbitrary). Why is it that the batching works so well btw? – eazar001 Jul 22 '13 at 22:18
  • @eazar001 Not exactly sure why batching helps, but glad it worked out for you. – FMc Jul 23 '13 at 01:10