2

I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?

Naman
  • 2,569
  • 4
  • 27
  • 44
  • Unless your lines all have a fixed length, Python doesn't simply know what a line is. It has to read everything to find the `\n` characters which denote the end of a line. Unless you can somehow calculate the byte offset of line endings because your data structure allows that, there's no magic workaround. – deceze Jun 29 '15 at 19:40
  • possible duplicate of [How to jump to a particular line in a huge text file?](http://stackoverflow.com/questions/620367/how-to-jump-to-a-particular-line-in-a-huge-text-file) – NightShadeQueen Jun 29 '15 at 19:41
  • @deceze Yes you are correct, there is no way for python to know the presence of '\n'. Unfortunately my current file doesn't have fix byte size for a line. I will keep this in mind for future purposes. How do you skip lines when you know the byte size of line? – Naman Jun 29 '15 at 19:44
  • might be useful http://stackoverflow.com/questions/30964244/read-specific-line-in-csv-file-python/30964275#30964275 http://stackoverflow.com/questions/30604459/pull-out-information-from-last-line-from-a-if-else-statement-within-a-for-loop-p/30604600#30604600 – Padraic Cunningham Jun 29 '15 at 19:52
  • @PadraicCunningham I guess linecache reads entire file in memory. With a file of size 44 Gb, it's not possible. By all the helpful answers and comments I realize that it might not be possible to do it without reading all the earlier records, it was a big mistake to not to keep the line size in bytes similar. But it's an important lesson to keep in mind going forward. – Naman Jun 29 '15 at 20:17
  • do you have control over the writing of the file? If so include a header with entry length (I'm assuming fixed entry length) then you could use `seek()` – DrBwts Jun 29 '15 at 20:21
  • @DrBwts Doesn't fixed entry length means fixed line size? Unfortunately the way I have those files even the entry length changes because the date format is 2/2/2015 not 02/02/2015. :) – Naman Jun 29 '15 at 20:23
  • @Naman, are you checking each column as you go to find the value? – Padraic Cunningham Jun 29 '15 at 20:23
  • @PadraicCunningham I am checking the date column after a csv.DictReader. Is there any faster way than that? – Naman Jun 29 '15 at 20:24
  • @Naman is there any pattern in the dates? Also how are you checking the date? – Padraic Cunningham Jun 29 '15 at 20:24
  • @PadraicCunningham Sorry I don't get the meaning of pattern. They are sorted as I mentioned in my problem description. There will be multiple entries per day. – Naman Jun 29 '15 at 20:26
  • I mean if you know the first date you can consume `n` lines instead of comparing and checking, "consume" consumes at c speed which is probably as about as good as you are going to get – Padraic Cunningham Jun 29 '15 at 20:28
  • By pattern I meant some increment in the date that you could calculate roughly how many `n` lines away your date was. How are you performing the date comparison? – Padraic Cunningham Jun 29 '15 at 20:29
  • @PadraicCunningham That does sound interesting. I can consume lines in chunk of n. How do I do read 100,000 lines in a chunk and then not keep them in memory? Could you please explain that or add that in an answer. I think that sounds the best alternative. – Naman Jun 29 '15 at 20:32
  • Itertools.islice will let you take n lines at a time. Have a look at the consume recipe in the links – Padraic Cunningham Jun 29 '15 at 20:37
  • 2
    If you have to do this enough, it might be worth converting the tsv into a database (like sqlite) and putting an index on the columns of interest. – jpkotta Jun 29 '15 at 20:41
  • @Naman Yes I entry length in your case is the same as line length. Don't you have control of the code that writes the files? It would be easy enough to pad out the entries/lines so they were all the same length or even change the date format – DrBwts Jun 29 '15 at 20:52
  • @DrBwts No I have the data in such large files. I could read all the files and convert the data with fixed line size. But this exercise doesn't seem that much worth it to me. But that's a good suggestion. – Naman Jun 29 '15 at 21:03
  • @jpkotta Actually that's exactly what I was looking in right now. Actually these query on dates are very limited only once and then I will iterate over remaining lines. Do you think even this will be faster with DB? If so I can easily switch over to some DB. – Naman Jun 29 '15 at 21:05
  • @Naman, what format are your dates in and how are you checking for the date you want? – Padraic Cunningham Jun 29 '15 at 21:17
  • @PadraicCunningham Date format is simple mm/dd/yyyy with no zero padding (eg. 3/18/2011). There are other columns also which are not constant size. – Naman Jun 29 '15 at 21:21
  • @Naman, are you converting to datetime or checking for an exact string? If you can upload a reasonable snippet of the file I will try a few different methods and see what is the most efficient – Padraic Cunningham Jun 29 '15 at 21:25
  • 1
    @Naman Hard to say. There will obviously be the overhead of importing into the database (once per file). Once it's in the database, I'd guess querying and extracting would be at least as fast as reading the tsv, but I'm not sure. You should make a quick prototype database with dummy data to find out. – jpkotta Jun 29 '15 at 22:58

5 Answers5

2

If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.

Robert Jacobs
  • 3,266
  • 1
  • 20
  • 30
0

If the file to read is big, and you don't want to read the whole file in memory at once:

fp = open("file")
for i, line in enumerate(fp):
    if i == 25:
        # 26th line
    elif i == 29:
        # 30th line
    elif i > 29:
        break
fp.close()

Note that i == n-1 for the nth line.

TYZ
  • 8,466
  • 5
  • 29
  • 60
0

You can use the method fileObject.seek(offset[, whence])

#offset -- This is the position of the read/write pointer within the file.

#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.


file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
    print(file.readline())
file.close()

To this code I use the next file:

100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
Adolfo Correa
  • 813
  • 8
  • 17
0

python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.

example:

# using python 3.4 syntax (parameter type annotation)

from datetime import datetime

def yield_right_dates(filepath: str, mydate: datetime):

    with open(filepath, 'r') as myfile:

        for line in myfile:
        # assume:
        #    the file is tab separated (because .tsv is the extension) 
        #    the date column has column-index == 0
        #    the date format is '%Y-%m-%d'
            line_splt = line.split('\t')
            if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
                yield line_splt

my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]

But this is still limiting you to one processor :(

Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.

I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.

With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.

Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
0

As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.

If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line

Pioneer_11
  • 670
  • 4
  • 19