3

I am trying to read a text file with 3 million lines using following code :

f = open("somefile.txt", "r")
i = 0
st = time.time()
mydata = []
for line in f:
    mydata.append(do_something(line))
    i += 1
    if i%10000 == 0:
        print "%d done in %d time..." % (time.time() - st)
        st = time.time()

Following is the output printed on console:

10000 done in 6 time...
20000 done in 9 time...
30000 done in 11 time...
40000 done in 14 time...
50000 done in 15 time...
60000 done in 17 time...
70000 done in 19 time...
80000 done in 21 time...
90000 done in 23 time...
100000 done in 24 time...
110000 done in 26 time...
120000 done in 28 time...
130000 done in 30 time...
140000 done in 32 time...
150000 done in 33 time...
160000 done in 36 time...
170000 done in 39 time...
180000 done in 41 time...
190000 done in 45 time...
200000 done in 48 time...
210000 done in 48 time...
220000 done in 53 time...
230000 done in 56 time...
......and so on.....

I am not sure why the time taken for reading same number of lines(10000) is increasing over iterations. Is there a way to avoid this or read big files in better way ?

Chandrahas
  • 325
  • 4
  • 14
  • 1
    Is it better without the mydata.append line? – Francesco Apr 05 '16 at 19:42
  • You would be surprised at how much slower an application runs with print statements. Remove your print statements and see if it helps improve performance. – idjaw Apr 05 '16 at 19:45
  • @Francesco, I need to save the processed information for later use. So, I can't avoid it. – Chandrahas Apr 05 '16 at 19:49
  • @idjaw, thats true, but initially I was running this code without print statement and left it for hours, still it didn't finish. Later I put the print statement to check what's happening. – Chandrahas Apr 05 '16 at 19:49
  • 1
    @Chandrahas Have you read the suggestions [here](http://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line) – idjaw Apr 05 '16 at 19:51
  • What is happening in `do_something`? Is `do_something` returning something large? You could eliminate it by not calling it... do `mydata.append(1)` to see how timing changes. – tdelaney Apr 05 '16 at 19:52
  • @idjaw : Thanks for the direction. I can trying reading in chunks. But I have checked other threads in stack-overflow which suggests reading line by line is a good way of reading files, [http://stackoverflow.com/a/519653/3336409] and [http://stackoverflow.com/a/14944267/3336409] – Chandrahas Apr 05 '16 at 20:07
  • @tdelaney : the return value is not really large. Its almost the same size as of the line. However, I can check if that helps. Thanks for the suggestion. – Chandrahas Apr 05 '16 at 20:08
  • 1
    In that case, `mydata.append(line)`. The goal is to see if this loop can be eliminated. Since the increase is linear, I'm suspicious of `do_something` more than this loop. – tdelaney Apr 05 '16 at 20:14
  • @tdelaney : Yes. I think thats the issue. "do_something" is linear in length of mydata as it needs to find out index `mydata.index(x)` and also it checks for duplicacy `x in mydata` . I guess that's the culprit. Thanks :) – Chandrahas Apr 05 '16 at 20:30
  • 1
    If `do_something` is returning something hashable like a string, maybe you can do this differently. Make `mydata` a `dict` with the data you want to track as the key and its index as value. It will be much faster than searching the list. – tdelaney Apr 05 '16 at 20:54
  • Yes. I used an inverse index dictionary along with actual list which made it run much faster. – Chandrahas Apr 05 '16 at 22:56

2 Answers2

2

Odd, but the most probable reason is memory consumption making your process slow. But for that to be your lines would have to be extremely long.

As your list grows, you take more RAM and it becomes harder for the OS to find a contiguous chunk to allocate as your list probably doubles (or something like that) in size as you add more lines.

Would be also helpful to know: 1. How many bytes does one line occupy 2. How much RAM do you have?

Also you should try profiling your task.

You could also pre-allocate your list by mydata = [0]*3e6

Sid
  • 7,511
  • 2
  • 28
  • 41
  • Thanks Sid for the quick reply. I also suspect that dynamic list might be causing trouble. Regarding your questions: 1. lines are really small, atmost 250 characters per line. 2. RAM size is also large, the system has 256GB of RAM. So, I hope memory limit is not an issue. But frequent memory allocation might be one culprit. Do you know any method to pre-allocate memory? – Chandrahas Apr 05 '16 at 20:01
  • mylist = [None] * 1e6 – Sid Apr 05 '16 at 20:16
  • Thanks for the suggestion. But I guess I got the culprit. do_something is doing something which takes time linear in length of mydata. – Chandrahas Apr 05 '16 at 20:31
  • Memory consumption wouldn't look like this. The numbers would be smooth until the swapping started and there would be a rather dramatic dropoff all at once. – tdelaney Apr 05 '16 at 20:48
  • True but the resize (if happening consistently) could. But I don't think that is the issue since 2.5 MB doesn't seem enough to cause a resize...also if resize then the time increase should drop off by power of 2s so that may not be it either. Being able to look at the code of doSomething may help. – Sid Apr 05 '16 at 20:53
1

Just to round this off with the answer... the time is increasing linearly which is unexpected for a simple append but is normal if you are processing the list with each new line. The slowdown is in do_something.

From the comments, do_something processes mydata and that operation keeps taking more time as mydata grows:

"do_something" is linear in length of mydata as it needs to find out index mydata.index(x) and also it checks for duplicacy x in mydata

tdelaney
  • 73,364
  • 6
  • 83
  • 116