1

I'm currently involved in a Python project that involves handling massive amounts of data. In this, I have to print massive amounts of data to files. They are always one-liners, but sometimes consisting of millions of digits.

The actual mathematical operations in Python only take seconds, minutes at most. Printing them to a file takes up to several hours; which I don't always have.

Is there any way of speeding up the I/O?
From what I figure, the number is stored in the RAM (Or at least I assume so, it's the only thing which would take up 11GB of RAM), but Python does not print it to a text file immediately. Is there a way to dump that information -- if it is the number -- to a file? I've tried Task Manager's Dump, which gave me a 22GB dump file (Yes, you read that right), and it doesn't look like there's what I was looking for in there, albeit it wasn't very clear.

If it makes a difference, I have Python 3.5.1 (Anaconda and Spyder), Windows 8.1 x64 and 16GB RAM.

By the way, I do run Garbage Collect (gc module) inside the script, and I delete variables that are not needed, so those 11GB aren't just junk.

Master-chip
  • 145
  • 11
  • is it being written to the terminal or a file? – Tadhg McDonald-Jensen Apr 27 '16 at 18:58
  • @TadhgMcDonald-Jensen A text file. – Master-chip Apr 27 '16 at 18:59
  • hmmm, you might be able to send the data to be written to a queue and have another thread processing the IO but I'm not sure it will help much. – Tadhg McDonald-Jensen Apr 27 '16 at 19:01
  • @TadhgMcDonald-Jensen That's an interesting idea... However, I'm not too sure how it would work either. Python is not that flexible. – Master-chip Apr 27 '16 at 19:03
  • see http://stackoverflow.com/questions/9259380/how-to-write-to-a-file-using-non-blocking-io – Tadhg McDonald-Jensen Apr 27 '16 at 19:22
  • How are you actually going to use the data in the file? Reading on the screen before falling asleep? By another process? Some visualisation? – Jan Vlcinsky Apr 27 '16 at 19:37
  • @TadhgMcDonald-Jensen It seems not to work for what I'm trying to do, it prints very slowly. It could be quicker however, say it prints in an hour and mine in ten, so I'll let it run for a while. – Master-chip Apr 27 '16 at 19:38
  • @JanVlcinsky It's going to eventually be used by another process, but I don't think that applies to this. – Master-chip Apr 27 '16 at 19:40
  • If using async IO doesn't help then all I can suggest is get a faster hard drive. (I wonder if writing two files to separate hard drives would provide any boost?) – Tadhg McDonald-Jensen Apr 27 '16 at 19:42
  • @TadhgMcDonald-Jensen Wouldn't writing multiple files be a significant slow down though? Unless both contain half the data, meaning you only have to write half the data, with twice the speed (Since it's two harddisks...) Thank you... – Master-chip Apr 27 '16 at 19:48
  • yes sorry, I meant writing half of the data to one hard drive and half to the other (both being written at the same time) Although that is assuming you can reduce the output into self contained chunks. – Tadhg McDonald-Jensen Apr 27 '16 at 19:53
  • @TadhgMcDonald-Jensen Theoretically, dividing it by two would do. Since the maths operations seem to be so fast, I'll just get the next process to multiply the two files together instead. (It _is_ one number, not a lot of data) – Master-chip Apr 27 '16 at 19:56
  • ... I'm sorry you are writing just digits of a single number... to a text file? As in each byte of the file is storing a single digit in base 10? A binary representation of the number would be `ln(10) / ln(2)` times shorter or even using 4 bits to store a digit of base 10 would half the size of the file. (although both methods means that it will need some amount of parsing before it is humanly readable) – Tadhg McDonald-Jensen Apr 27 '16 at 20:06
  • @TadhgMcDonald-Jensen Yes, that's right. I wasn't to excited myself when I coded text files in. But I want to have the least possible handling of this as possible, because the more I handle the data the more I'm afraid it will slow _everything_ down, which is why I don't want to convert it to a different representation of what it currently is (int value) – Master-chip Apr 28 '16 at 11:34

3 Answers3

0

If you are indeed I/O bound by the time it takes to write the file, multi-threading with a pool of threads may help. Of course, there is a limit to that, but at least, it would allow you to issue non-blocking file writes.

Cyb3rFly3r
  • 1,321
  • 7
  • 12
  • I'd recommend showing (or adding links to) examples of safe use of asynchronous IO since it can be easy to corrupt files if not handled properly. – Tadhg McDonald-Jensen Apr 27 '16 at 19:03
  • I tried multithreading without much luck, however that might have been some instance of bad coding. Other than that, I have never been too familiar with multi-threading. How do you suggest I would do this? – Master-chip Apr 27 '16 at 19:15
0

Multithreading could speed it up (have printers on other threads that you write to in memory that have a queue).

Maybe a system design standpoint, but maybe evaluate whether or not you need to write everything to the file. Perhaps consider creating various levels of logging so that a release mode could run faster (if that makes sense in your context).

Bradford Medeiros
  • 1,581
  • 2
  • 9
  • 6
  • The only way I could simplify my data would be to display it as an exponent, but then I'd lose accuracy, which I can't allow. However, this is an interesting idea. How do you suggest I create different loggers? – Master-chip Apr 27 '16 at 19:13
0

Use HDF5 file format

The problem is, you have to write a lot of data.

HDF5 is format being very efficient in size and allowing access to it by various tools.

Be prepared for few challenges:

  • there are multiple python packages for HDF5, you will have to find the one which fits your needs
  • installation is not always very simple (but there might be Windows installation binary)
  • expect a bit of study to understand the data structures to be stored.
  • it will occasionally need some CPU cycles - typically you write a lot of data quickly and at one moment it has to be flushed to the disk. At this moment it starts compressing the data what can take few seconds. See GIL for IO bounded thread in C extension (HDF5)

Anyway, I think, it is very likely, you will manage and apart of faster writes to the files you will also gain smaller files, which are simpler to handle.

Community
  • 1
  • 1
Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98