0

I'm doing a machine learning project, my dataset is composed of thousands of x-ray pictures, every time I want to work on this project I have to reload the pictures and pre-process them, which is very time-consuming so I want to read my images once and write the list of thousands of 224x224x3 matrices in a file that I can load everytime I need to work on this project.

I've already found some functions that allow me to write/read lists, but they don't seem to write the whole matrices but only a part:

This is the code I used to write the file:

with open(obj_dir +"train_data_p", "w") as file:
  file.write(str(train_data_p))

This is what I get when I open my training dataset file with notepad, as you can see from the "...," parts, it's showing only snippets of matrices:

[array([[[0.26666668, 0.26666668, 0.26666668],
        [0.32156864, 0.32156864, 0.32156864],
        [0.33333334, 0.33333334, 0.33333334],
        ...,
        [0.75686276, 0.75686276, 0.75686276],
        [0.77254903, 0.77254903, 0.77254903],
        [0.7764706 , 0.7764706 , 0.7764706 ]],
   [[0.27058825, 0.27058825, 0.27058825],
    [0.28627452, 0.28627452, 0.28627452],
    [0.31764707, 0.31764707, 0.31764707],
    ...,
    [0.7607843 , 0.7607843 , 0.7607843 ],
    [0.7647059 , 0.7647059 , 0.7647059 ],
    [0.8039216 , 0.8039216 , 0.8039216 ]],

   [[0.3019608 , 0.3019608 , 0.3019608 ],
    [0.34901962, 0.34901962, 0.34901962],
    [0.27058825, 0.27058825, 0.27058825],
    ...,
    [0.78431374, 0.78431374, 0.78431374],
    [0.7764706 , 0.7764706 , 0.7764706 ],
    [0.78431374, 0.78431374, 0.78431374]],

   ...,

   [[0.1254902 , 0.1254902 , 0.1254902 ],
    [0.1254902 , 0.1254902 , 0.1254902 ],
    [0.12156863, 0.12156863, 0.12156863],

How can I write/store the whole dataset so I don't have to read and process the images everytime? Help me please!

9879ypxkj
  • 387
  • 5
  • 15

3 Answers3

1

You can do it by numpy.save() and numpy.load() methods

import numpy as np
np.save('/tmp/123', np.array([[1, 2, 3], [4, 5, 6]]))
np.load('/tmp/123.npy')
1

The reason that you are seeing ellipsis in the file is because you are writing str(train_data_p) to the file, and not actual train_data_p object.

As pointed by other answers, there are numerous packages that help storing large data. If you are using numpy, this answer may help you too.

S.Au.Ra.B.H
  • 457
  • 5
  • 9
0

You can serialize your data using builtin modules easy.

We have different options list:

Or any other 3rd party serialization package available in pip.

More about serialization https://en.wikipedia.org/wiki/Serialization

Alexandr Shurigin
  • 3,921
  • 1
  • 13
  • 25
  • Not a great idea for large arrays – Mad Physicist Dec 10 '19 at 22:20
  • why do you think so? It is great for any size of array, pickle is the best for any "general object" for serialization without structuring it using specialized packers which require data schema declaration and etc. – Alexandr Shurigin Dec 10 '19 at 22:27
  • Anyway, it is much better and faster rather than process images or any other source of the data and that's the reason for the question I believe. – Alexandr Shurigin Dec 10 '19 at 22:31