2

I have a very large file that I want to open and read specific lines from it, I always know what line number the data I want is at, but I don't want to have to read the entire file each time just to read that specific line.

Is there a way you can only read specific lines in Python? Or what is the most efficient way possible to do this (i.e. read as little of the file as possible, to speed up execution)?

KillerKode
  • 957
  • 1
  • 12
  • 31
  • 1
    Does this answer your question? [Reading specific lines only](https://stackoverflow.com/questions/2081836/reading-specific-lines-only) – Selcuk Dec 10 '20 at 00:18
  • 1
    If all the lines have EXACTLY the same number of characters/bytes then there may be a way to seek to that position, but if the lines can be different lengths then there is no way to know where line two starts until after you have read line one and found the newline at its end. – Jerry Jeremiah Dec 10 '20 at 00:20
  • The second answer [here](https://stackoverflow.com/questions/620367/how-to-jump-to-a-particular-line-in-a-huge-text-file) is a good approach. You will have to go over the file at least once though. – ssp Dec 10 '20 at 00:25
  • 1
    linecache: https://docs.python.org/3/library/linecache.html – KetZoomer Dec 10 '20 at 00:28
  • @Selcuk thanks for the link, I did not see that before. However, it appears it's a slightly different question / answers - as the answers there are focused on reading specific lines in a memory efficient way, but it still seems like they are reading the entire file each time, just not storing all lines in memory. – KillerKode Dec 10 '20 at 15:30
  • @KetZoomer, thanks that's a good find but I don't think linecache is a good idea as I understand it loads the entire file in memory first? – KillerKode Dec 10 '20 at 15:34
  • 1
    @KillerKode yup, thats correct – KetZoomer Dec 10 '20 at 16:46

2 Answers2

5

Here are some options:

  1. Go over the file at least once and keep track of the file offsets of the lines you are interested in. This is a good approach if you might be seeking these lines multiple times and the file wont be changed.
  2. Consider changing the data format. For example csv instead of json (see comments).
  3. If you have no other alternative, use the traditional:
def get_lines(..., linenums: list):
    with open(...) as f:
        for lno, ln in enumerate(f):
            if lno in linenums:
                yield ln

On a 4GB file this took ~6s for linenums = [n // 4, n // 2, n - 1] where n = lines_in_file.

ssp
  • 1,666
  • 11
  • 15
  • 1
    I like the idea of point number 1, I can go over the file and store metadata to tell me how much to seek fo reach line, I may accept this solution as I think it's most likely the best approach, but will wait a little longer for more answers just in case. – KillerKode Dec 10 '20 at 15:35
  • 1
    Have just done some testing, for me to read my sample dataset the way I currently do in this large JSON file, it takes 159 seconds just to read specific parts of that file. When I reformat the same data into a CSV style format and then store metadata about it's offsets in another file (using approach 1), it takes 0.88 seconds to read both files and get me the lines I am interested in. That's over x159 times faster!!! Very good, remove point 2 from your answer and I will accept. – KillerKode Dec 10 '20 at 20:35
  • 1
    I also estimated that even after another 20 years of data collection, this won't take more than 3 seconds to load the bits I need which is more than enough for my needs. The combination of ditching JSON and seeking the file makes a huge difference. – KillerKode Dec 10 '20 at 20:36
  • @KillerKode Wow that's quite a difference; nice! I think I'll leave point 2 because even though it's not the best solution for you, if someone else comes across this question and has a medium-sized-or-less file _that will be changed_ it would probably be the better alternative. – ssp Dec 11 '20 at 00:23
  • @KillerKode ok upon further investigation it looks like `linecache` is simply not a great solution in general. It can't read binary files for example and it's main use purpose is within the python source code itself. I will update my answer. – ssp Dec 11 '20 at 00:56
0

This is sadly not possible due to a simple reason: lines do not exist. What your text editor shows you as a line is just two pieces of text with a newline character in the middle (you can type it with \n in python. If all lines have the same length then it is possible, but I assume that is not the case here.

The least amount of reading is done if you only read up to your content + your content. That means you should not use read or readlines. Instead use readline to get and discard the unneeded lines, then use it once more to get what you want. That is the most effective way probably.

DownloadPizza
  • 3,307
  • 1
  • 12
  • 27