How to store HUGE python list as a file and then read the file as a list in python?

Question

I'm doing a machine learning project, my dataset is composed of thousands of x-ray pictures, every time I want to work on this project I have to reload the pictures and pre-process them, which is very time-consuming so I want to read my images once and write the list of thousands of 224x224x3 matrices in a file that I can load everytime I need to work on this project.

I've already found some functions that allow me to write/read lists, but they don't seem to write the whole matrices but only a part:

This is the code I used to write the file:

with open(obj_dir +"train_data_p", "w") as file:
  file.write(str(train_data_p))

This is what I get when I open my training dataset file with notepad, as you can see from the "...," parts, it's showing only snippets of matrices:

[array([[[0.26666668, 0.26666668, 0.26666668],
        [0.32156864, 0.32156864, 0.32156864],
        [0.33333334, 0.33333334, 0.33333334],
        ...,
        [0.75686276, 0.75686276, 0.75686276],
        [0.77254903, 0.77254903, 0.77254903],
        [0.7764706 , 0.7764706 , 0.7764706 ]],
   [[0.27058825, 0.27058825, 0.27058825],
    [0.28627452, 0.28627452, 0.28627452],
    [0.31764707, 0.31764707, 0.31764707],
    ...,
    [0.7607843 , 0.7607843 , 0.7607843 ],
    [0.7647059 , 0.7647059 , 0.7647059 ],
    [0.8039216 , 0.8039216 , 0.8039216 ]],

   [[0.3019608 , 0.3019608 , 0.3019608 ],
    [0.34901962, 0.34901962, 0.34901962],
    [0.27058825, 0.27058825, 0.27058825],
    ...,
    [0.78431374, 0.78431374, 0.78431374],
    [0.7764706 , 0.7764706 , 0.7764706 ],
    [0.78431374, 0.78431374, 0.78431374]],

   ...,

   [[0.1254902 , 0.1254902 , 0.1254902 ],
    [0.1254902 , 0.1254902 , 0.1254902 ],
    [0.12156863, 0.12156863, 0.12156863],

How can I write/store the whole dataset so I don't have to read and process the images everytime? Help me please!

consider using things such as pickle – LoneWanderer Dec 10 '19 at 22:15 — LoneWanderer, Dec 10 '19 at 22:15

score 1 · Answer 1 · answered Dec 10 '19 at 22:20

1

You can do it by numpy.save() and numpy.load() methods

import numpy as np
np.save('/tmp/123', np.array([[1, 2, 3], [4, 5, 6]]))
np.load('/tmp/123.npy')

answered Dec 10 '19 at 22:20

Ravi Satya Yenugula

111
1
7

score 1 · Answer 2 · answered Dec 10 '19 at 22:31

The reason that you are seeing ellipsis in the file is because you are writing str(train_data_p) to the file, and not actual train_data_p object.

As pointed by other answers, there are numerous packages that help storing large data. If you are using numpy, this answer may help you too.

score 0 · Answer 3 · answered Dec 10 '19 at 22:18

0

You can serialize your data using builtin modules easy.

We have different options list:

Or any other 3rd party serialization package available in pip.

More about serialization https://en.wikipedia.org/wiki/Serialization

answered Dec 10 '19 at 22:18

Alexandr Shurigin

3,921
1
13
25

Not a great idea for large arrays – Mad Physicist Dec 10 '19 at 22:20
why do you think so? It is great for any size of array, pickle is the best for any "general object" for serialization without structuring it using specialized packers which require data schema declaration and etc. – Alexandr Shurigin Dec 10 '19 at 22:27
Anyway, it is much better and faster rather than process images or any other source of the data and that's the reason for the question I believe. – Alexandr Shurigin Dec 10 '19 at 22:31

How to store HUGE python list as a file and then read the file as a list in python?

3 Answers3