0

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.

I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.

The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)

I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)

The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.

Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.

The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.

Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.

Thanks in advance,

Art.

artembus
  • 427
  • 1
  • 6
  • 13
  • 2
    I recommend sqlite. it fits very well with this problem and there is no need to install. – MEdwin Nov 26 '18 at 14:44
  • 2
    As MEdwin said and as you alluded to, I think your best bet is to change the file away from a text file, and into either some form of SQL file or HDF5. You might be able to read in the file quicker if you pickle it or something, but in my experience, that doesn't make a huge difference. – John Rouhana Nov 26 '18 at 14:46
  • 2
    Another possible approach is to seek to a random position in the file and move backwards/forwards until you can isolate a line, and repeat... – Jon Clements Nov 26 '18 at 14:49

1 Answers1

1

As said in the comments, I believe using hdf5 would we a good option. This answer shows how to read that kind of file

Pedro Borges
  • 1,240
  • 10
  • 20