0

I am trying to access a file, it has 27000+ lines so when I read it takes too long that is 30mins or more. Now just to clarify I am running it in a Coursera external Jupyter notebook, so I don't think it is a system limitation.

with open(filename) as training_file:
  # Your code starts here
    file = training_file.read()
    lines = file.split('\n')

    images = []
    labels = []
    images = np.array(images)
    labels = np.array(labels)

    c=0
    for line in lines[1:]:
        row = line.split(',')
        labels = np.append(labels, row[0])
        images = np.append(images, np.array_split(row[1:], 28))
        c += 1
        print(c)

    images = images.astype(np.float64)
    labels = labels.astype(np.float64)
  # Your code ends here
return images, labels
Kruti Deepan Panda
  • 324
  • 1
  • 4
  • 16
  • *"so I don't think it is a system limitation"* what makes you assume that? It's actually sensible to assume that system limitations are in place in free, public, shared environments. – Tomalak May 24 '20 at 12:56
  • I assumed so because it runs deep learning quite easily. I was able to train datasets of 10000 images in a matter of minutes – Kruti Deepan Panda May 24 '20 at 12:58
  • 1
    Other than that, I am quite sure there are [facilities in numpy to consume CSV files](https://stackoverflow.com/a/3519314/18771) that are faster than any custom roll-your-own-string-split approach. – Tomalak May 24 '20 at 13:02
  • But overall, your approach is inefficient because you are creating a new copy of the arrays with every loop iteration, which is a *huge* amount of work and a waste of memory. If you want your approach to be faster, you need to figure out the Numpy array size before you allocate it, and then allocate the right size *once*, and then go over the input file to fill the array. – Tomalak May 24 '20 at 13:14
  • @taha You were right for my code that was the problem. I was using numpy's append. It finished way faster when I used list's append – Kruti Deepan Panda May 24 '20 at 14:49
  • @taha your solution worked. NumPy's append was way slower than list's append. Can you post it as an answer so I can close this question – Kruti Deepan Panda May 24 '20 at 14:54

3 Answers3

1

Use the built-in numpy functions for reading a CSV (fromfile, genfromtxt etc) rather than rolling your own; they're written in C and much faster than doing the same thing in Python.

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
0

you can save on memory by reading the file line by line with a for loop:

with open("filename") as f: 
    for line in f: 
        <your code>

but as mentioned in other comments, there are CSV tools you can use that will be way faster: see csv or numpy

Almog-at-Nailo
  • 1,152
  • 1
  • 4
  • 20
0

Are you sure that it takes too much time because of file reading? Comment out numpy code and run only file read part. In my opinion, numpy.append is the slowest part. Have a look at this: NumPy append vs Python append

taha
  • 722
  • 7
  • 15