How to read more efficiently in Python?

Question

I am trying to access a file, it has 27000+ lines so when I read it takes too long that is 30mins or more. Now just to clarify I am running it in a Coursera external Jupyter notebook, so I don't think it is a system limitation.

with open(filename) as training_file:
  # Your code starts here
    file = training_file.read()
    lines = file.split('\n')

    images = []
    labels = []
    images = np.array(images)
    labels = np.array(labels)

    c=0
    for line in lines[1:]:
        row = line.split(',')
        labels = np.append(labels, row[0])
        images = np.append(images, np.array_split(row[1:], 28))
        c += 1
        print(c)

    images = images.astype(np.float64)
    labels = labels.astype(np.float64)
  # Your code ends here
return images, labels

*"so I don't think it is a system limitation"* what makes you assume that? It's actually sensible to assume that system limitations are in place in free, public, shared environments. — Tomalak, May 24 '20 at 12:56
I assumed so because it runs deep learning quite easily. I was able to train datasets of 10000 images in a matter of minutes — Kruti Deepan Panda, May 24 '20 at 12:58
Other than that, I am quite sure there are [facilities in numpy to consume CSV files](https://stackoverflow.com/a/3519314/18771) that are faster than any custom roll-your-own-string-split approach. — Tomalak, May 24 '20 at 13:02
But overall, your approach is inefficient because you are creating a new copy of the arrays with every loop iteration, which is a *huge* amount of work and a waste of memory. If you want your approach to be faster, you need to figure out the Numpy array size before you allocate it, and then allocate the right size *once*, and then go over the input file to fill the array. — Tomalak, May 24 '20 at 13:14
@taha You were right for my code that was the problem. I was using numpy's append. It finished way faster when I used list's append — Kruti Deepan Panda, May 24 '20 at 14:49
@taha your solution worked. NumPy's append was way slower than list's append. Can you post it as an answer so I can close this question — Kruti Deepan Panda, May 24 '20 at 14:54

score 1 · Answer 1 · answered May 24 '20 at 13:09

1

Use the built-in numpy functions for reading a CSV (fromfile, genfromtxt etc) rather than rolling your own; they're written in C and much faster than doing the same thing in Python.

answered May 24 '20 at 13:09

Jiří Baum

6,697
2
17
17

That's the problem. It's part of an assignment. I am only allowed to edit the part below the with statement. – Kruti Deepan Panda May 24 '20 at 14:10
That's OK, all of those functions can take the already-open `training_file` instead of a filename. – Jiří Baum May 24 '20 at 23:40

score 0 · Answer 2 · answered May 24 '20 at 13:16

0

you can save on memory by reading the file line by line with a for loop:

with open("filename") as f: 
    for line in f: 
        <your code>

but as mentioned in other comments, there are CSV tools you can use that will be way faster: see csv or numpy

answered May 24 '20 at 13:16

Almog-at-Nailo

1,152
1
4
20

can you please give me an example on how to manipulate the 'line' variable – Kruti Deepan Panda May 24 '20 at 14:11

score 0 · Accepted Answer · answered May 24 '20 at 15:03

0

Are you sure that it takes too much time because of file reading? Comment out numpy code and run only file read part. In my opinion, numpy.append is the slowest part. Have a look at this: NumPy append vs Python append

answered May 24 '20 at 15:03

taha

722
7
15

How to read more efficiently in Python?

3 Answers3