Debugging a python script which first needs to read large files. Do I have to load them every time anew?

Question

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.

It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.

Are there any tricks to do something like this?

(If it is feasible, I create smaller test files)

Do you need to read the entire file in at once, or can you restructure your code to read it line by line? — Eric, Apr 17 '15 at 09:50
Maybe I could use the interactive python to read the files and then run the script from there, which then can use the data via imports? — Godrebh, Apr 17 '15 at 10:04
Are you sure that reading in the file is where your time is going? — Eric, Apr 17 '15 at 10:05
Yes, I am reading eleven cPickle files in total and after each I print out a message. Until now, I always debugged on smaller files, which only contain a subset of the original file. But I am still curious if it is possible to read all information once, and use it in another script which I am currently working on. And I am not always using pickle files, I also want to do this with text files or large tables. — Godrebh, Apr 17 '15 at 10:36

score 0 · Answer 1 · edited May 23 '17 at 11:43

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?

Some other suggestions not directly related to Python.

Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.

Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.

Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.

If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

1 Answers1