Which one is faster in production? file on disk vs file in memory(StringIO,BytesIO)

Question

I am converting a dictionary to pandas object with to_csv. I have both way of doing this

1 - by writing file in disk(with open statement)

2 - by writing in memory (StringIO,BytesIO)

I have used it in both way creating file in disk and using StringIO to convert to pandas object. I tried to read comparisons between these three, but bit confused which one is faster so i can use it in production to process tons of data.

@jizhihaoSAMA How? But i also need some experienced advice, what they use in production for millions of data processing. — Chidananda Nayak, Apr 09 '20 at 04:51
If we just consider the speed, how possible could the disk be faster than the memory? — Sraw, Apr 09 '20 at 04:57
Your description doesn't make sense to me. `to_csv` doesn't convert a dictionary to a pandas object, it writes a csv using a dataframe. — juanpa.arrivillaga, Apr 09 '20 at 05:04
@juanpa.arrivillaga I mean i am converting dict to dataframe then to csv. — Chidananda Nayak, Apr 09 '20 at 05:06

score 2 · Answer 1 · answered Apr 09 '20 at 04:52

2

Writing and reading from memory is fast. But keep in mind that you have tons of data. So storing all that in-memory might take up all your memory and might make the system slow or might throw errors due to Out of Memory. So, analyze and understand which all data to be put in memory and which all to be written to files.

answered Apr 09 '20 at 04:52

Nandu Raj

2,072
9
20

i think it would better to use file object in disk than memory. Can i use tempfile module for that? – Chidananda Nayak Apr 09 '20 at 04:55
Yea that's the usual practice when we have lots of data. We would store it in files, but we make use of caches. The data which frequently access will be stored in memory (specifically cache), and the others will be read from disk or DB on request. I am not sure on the tempfile module. – Nandu Raj Apr 09 '20 at 04:58

score 2 · Answer 2 · answered Apr 09 '20 at 05:03

In general - writing to RAM (memory) will be faster.

However, you might want to use Iterators (saving memory using iterators) if you have too much data, because your machine might will run out-of-memory, or just will write a lot to your SWAP file (in short - that's an "extension" of your RAM in your hard drive, you can read about it here), which will hurt your performance, a lot.

For benchmarking, if your code is pretty simple - I would recommenced using timeit, but there are even better resources for that, such as this one, from scipy

Which one is faster in production? file on disk vs file in memory(StringIO,BytesIO)

2 Answers2