7

I am trying to read and process a large file in chunks with Python. I am following this blog that proposes a very fast way of reading and processing large chunks of data spread over multiple processes. I have only slightly updated the existing code, i.e. using stat(fin).st_size over os.path.getsize. In the example I also haven't implemented multiprocessing, as the issue also manifests itself in a single process. That makes it easier to debug.

The issue that I am having with this code, is that it returns broken sentences. This makes sense: the pointers do not take line endings into account, and just return some given byte size. In practice, one would assume that you could solve this by leaving out the last item in the fetched batch of lines, as that would most probably be the broken line. Unfortunately that does not work reliably either.

from os import stat


def chunkify(pfin, buf_size=1024):
    file_end = stat(pfin).st_size
    with open(pfin, 'rb') as f:
        chunk_end = f.tell()

        while True:
            chunk_start = chunk_end
            f.seek(buf_size, 1)
            f.readline()
            chunk_end = f.tell()
            yield chunk_start, chunk_end - chunk_start

            if chunk_end > file_end:
                break


def process_batch(pfin, chunk_start, chunk_size):
    with open(pfin, 'r', encoding='utf-8') as f:
        f.seek(chunk_start)
        batch = f.read(chunk_size).splitlines()

    # changing this to batch[:-1] will result in 26 lines total
    return batch


if __name__ == '__main__':
    fin = r'data/tiny.txt'
    lines_n = 0
    for start, size in chunkify(fin):
        lines = process_batch(fin, start, size)
        # Uncomment to see broken lines
        # for line in lines:
        #    print(line)
        # print('\n')
        lines_n += len(lines)

    print(lines_n)
    # 29

The code above will print 29 as the total of processed lines. When you do not return the last item of the batch, naively assuming that that is a broken line anyway, you'll get 26. The actual number of lines is 27. The testing data can be found below.

She returned bearing mixed lessons from a society where the tools of democracy still worked.
If you think you can sense a "but" approaching, you are right.
Elsewhere, Germany take on Brazil and Argentina face Spain, possibly without Lionel Messi.
What sort of things do YOU remember best?'
Less than three weeks after taking over from Lotz at Wolfsburg.
The buildings include the Dr. John Micallef Memorial Library.
For women who do not have the genes, the risk drops to just 2% for ovarian cancer and 12% for breast cancer.
In one interview he claimed it was from the name of the Cornish language ("Kernewek").
8 Goldschmidt was out of office between 16 and 19 July 1970.
Last year a new law allowed police to shut any bar based on security concerns.
But, Frum explains: "Glenn Beck takes it into his head that this guy is bad news."
Carrying on the Romantic tradition of landscape painting.
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
Dietler also said Abu El Haj was being opposed because she is of Palestinian descent.
The auction highlights AstraZeneca's current focus on boosting returns to shareholders as it heads into a wave of patent expiries on some of its biggest selling medicines including Nexium, for heartburn and stomach ulcers, and Seroquel for schizophrenia and bipolar disorder.
GAAP operating profit was $13.2 million and $7.1 million in the second quarter of 2008 and 2007, respectively.
Doc, Ira, and Rene are sent home as part of the seventh bond tour.
only I am sick of always hearing him called the Just.
Also there is Meghna River in the west of Brahmanbaria.
The explosives were the equivalent of more than three kilograms of dynamite - equal to 30 grenades," explained security advisor Markiyan Lubkivsky to reporters gathered for a news conference in Kyiv.
Her mother first took her daughter swimming at the age of three to help her with her cerebal palsy.
A U.S. aircraft carrier, the USS "Ticonderoga", was also stationed nearby.
Louis shocked fans when he unexpectedly confirmed he was expecting a child in summer 2015.
99, pp.
Sep 19: Eibar (h) WON 6-1

If you print out the created lines, you'll see that, indeed, broken sentences occur. I find this odd. Should't f.readline() ensure that the file is read until the next line? In the output below, the empty line separates two batches. That means that you cannot check a line with the next line in a batch, and remove it if it's a substring - the broken sentence belongs to another batch than the full sentence.

...
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, r


In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
...

Is there a way to get rid of these broken sentences, without removing too much?

You can download a larger test file (100,000 lines) here.


After a lot of digging, it seems that actually some inaccessible buffer is responsible for the inconsistent behaviour of seek, as discussed here and here. I tried out the proposed solution to use iter(f.readline, '') with seek but that still gives me inconsistent results. I have updated my code to return the file pointer after each batch of 1500 lines, but in reality the batches return will overlap.

from os import stat
from functools import partial


def chunkify(pfin, max_lines=1500):
    file_end = stat(pfin).st_size
    with open(pfin, 'r', encoding='utf-8') as f:
        chunk_end = f.tell()

        for idx, l in enumerate(iter(f.readline, '')):
            if idx % max_lines == 0:
                chunk_start = chunk_end
                chunk_end = f.tell()
                # yield start position, size, and is_last
                yield chunk_start, chunk_end - chunk_start

    chunk_start = chunk_end
    yield chunk_start, file_end


def process_batch(pfin, chunk_start, chunk_size):
    with open(pfin, 'r', encoding='utf-8') as f:
        f.seek(chunk_start)
        chunk = f.read(chunk_size).splitlines()

    batch = list(filter(None, chunk))

    return batch


if __name__ == '__main__':
    fin = r'data/100000-ep+gutenberg+news+wiki.txt'

    process_func = partial(process_batch, fin)
    lines_n = 0

    prev_last = ''
    for start, size in chunkify(fin):
        lines = process_func(start, size)

        if not lines:
            continue

        # print first and last ten sentences of batch
        for line in lines[:10]:
            print(line)
        print('...')
        for line in lines[-10:]:
            print(line)
        print('\n')

        lines_n += len(lines)

    print(lines_n)

An example of overlapping batches is below. The first two and a half sentence of the last batch are duplicated from the last sentences of the batch before. I don't know how to explain nor solve this.

...
The EC ordered the SFA to conduct probes by June 30 and to have them confirmed by a certifying authority or it would deduct a part of the funding or the entire sum from upcoming EU subsidy payments.
Dinner for two, with wine, 250 lari.
It lies a few kilometres north of the slightly higher Weissmies and also close to the slightly lower Fletschhorn on the north.
For the rest we reached agreement and it was never by chance.
Chicago Blackhawks defeat Columbus Blue Jackets for 50th win
The only drawback in a personality that large is that no one els


For the rest we reached agreement and it was never by chance.
Chicago Blackhawks defeat Columbus Blue Jackets for 50th win
The only drawback in a personality that large is that no one else, whatever their insights or artistic pedigree, is quite as interesting.
Sajid Nadiadwala's reboot version of his cult classic "Judwaa", once again directed by David Dhawan titled "Judwaa 2" broke the dry spell running at the box office in 2017.
They warned that there will be a breaking point, although it is not clear what that would be.
...

In addition to this, I have also tried removing the readline from the original code, and keeping track of a remaining, incomplete chunk. The incomplete chunk is then passed to the next chunk and added to its front. The issue that I am running into now, is that because the text is read in byte chunks, it can happen that a chunk ends without completely finishing a character's bytes. This wille lead to decoding errors.

from os import stat


def chunkify(pfin, buf_size=1024):
    file_end = stat(pfin).st_size
    with open(pfin, 'rb') as f:
        chunk_end = f.tell()

        while True:
            chunk_start = chunk_end
            f.seek(buf_size, 1)
            chunk_end = f.tell()
            is_last = chunk_end >= file_end
            # yield start position, size, and is_last
            yield chunk_start, chunk_end - chunk_start, is_last

            if is_last:
                break


def process_batch(pfin, chunk_start, chunk_size, is_last, leftover):
    with open(pfin, 'r', encoding='utf-8') as f:
        f.seek(chunk_start)
        chunk = f.read(chunk_size)

    # Add previous leftover to current chunk
    chunk = leftover + chunk
    batch = chunk.splitlines()
    batch = list(filter(None, batch))

    # If this chunk is not the last one,
    # pop the last item as that will be an incomplete sentence
    # We return this leftover to use in the next chunk
    if not is_last:
        leftover = batch.pop(-1)

    return batch, leftover


if __name__ == '__main__':
    fin = r'ep+gutenberg+news+wiki.txt'

    lines_n = 0
    left = ''
    for start, size, last in chunkify(fin):
        lines, left = process_batch(fin, start, size, last, left)

        if not lines:
            continue

        for line in lines:
            print(line)
        print('\n')

        numberlines = len(lines)
        lines_n += numberlines

    print(lines_n)

Running the code above, will inevitably result in a UnicodeDecodeError.

Traceback (most recent call last):
  File "chunk_tester.py", line 46, in <module>
    lines, left = process_batch(fin, start, size, last, left)
  File "chunk_tester.py", line 24, in process_batch
    chunk = f.read(chunk_size)
  File "lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • From this, I infer that you have failed to get a comparable failure with the smaller file, and the points of discrepancy change from run to run? This still doesn't look like a MCVE. – Prune Apr 24 '19 at 22:59
  • 1
    @Prune The results _are_ reproducible, i.e. for the same file they are identical with every run. However, the behavior does seem to differ depending on the buf_size, as explained in the last paragraph. Still, my errors are reproducible, making this a MCVE. – Bram Vanroy Apr 25 '19 at 06:30
  • @MisterMiyagi Again, the goal and topic is the same. You didn't have to delete your answer as it still applies; it just doesn't work. The goal is to read a file as chunks, get the file pointers' positions of these chunks, pass them down to a function, and then read the actual sentences in those chunks without overlapping between different chunks or leaving text out. – Bram Vanroy Apr 25 '19 at 14:15
  • Have you tried using `ctypes.CDLL(ctypes.util.find_library('c'))` directly? Or, alternatively, handling the data as `bytes` until you're completely certain you've got a line, and only _then_ converting it to `str`? – wizzwizz4 Apr 27 '19 at 10:53
  • @wizzwizz4 I haven't tried such approach, no. – Bram Vanroy Apr 27 '19 at 19:06
  • @BramVanroy Well, it works! The meta effect finally achieved something _positive_. (You wasted a bounty, mate. :-p) – wizzwizz4 Apr 27 '19 at 19:34

3 Answers3

2

You were so close! A relatively simple change to your final code (reading in the data as bytes and not str) makes it all (almost) work.

The main issue was because reading from binary files counts bytes, but reading from text files counts text, and you did your first counting in bytes and your second in characters, leading to your assumptions about what data had already been read to be wrong. It's nothing about an internal, hidden buffer.

Other changes:

  • The code needs to split on b'\n' instead of using bytes.splitlines(), and only remove blank lines after the relevant detection code.
  • Unless the size of the file changes (in which case your existing code will break anyway), chunkify can be replaced by a simpler, faster loop that's functionally identical without having to keep the file open.

This gives the final code:

from os import stat

def chunkify(pfin, buf_size=1024**2):
    file_end = stat(pfin).st_size

    i = -buf_size
    for i in range(0, file_end - buf_size, buf_size):
        yield i, buf_size, False

    leftover = file_end % buf_size
    if leftover == 0:  # if the last section is buf_size in size
        leftover = buf_size
    yield i + buf_size, leftover, True

def process_batch(pfin, chunk_start, chunk_size, is_last, leftover):
    with open(pfin, 'rb') as f:
        f.seek(chunk_start)
        chunk = f.read(chunk_size)

    # Add previous leftover to current chunk
    chunk = leftover + chunk
    batch = chunk.split(b'\n')

    # If this chunk is not the last one,
    # pop the last item as that will be an incomplete sentence
    # We return this leftover to use in the next chunk
    if not is_last:
        leftover = batch.pop(-1)

    return [s.decode('utf-8') for s in filter(None, batch)], leftover


if __name__ == '__main__':
    fin = r'ep+gutenberg+news+wiki.txt'

    lines_n = 0
    left = b''
    for start, size, last in chunkify(fin):
        lines, left = process_batch(fin, start, size, last, left)

        if not lines:
            continue

        for line in lines:
            print(line)
        print('\n')

        numberlines = len(lines)
        lines_n += numberlines

    print(lines_n)
wizzwizz4
  • 6,140
  • 2
  • 26
  • 62
  • if you put a large chunk size into this code, and vary it, it returns an inconsistent number of lines in total (100,000, 99,999 etc.) I used chunk sizes of 1_000_000 and 1_000_003 on the Gutenberg file to verify this. My guess is it isn't handling the case where more than one chunk in succession falls across Unicode boundaries. Also note that there is a problem with the Gutenberg file when you use `readlines()`: `UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5608: character maps to ` – Nic Apr 27 '19 at 20:25
  • @Nick No, that's a problem with the `process_batch` loop I think; if that happens to line up so that the chunk ends at a line break, I think it joins the lines together. – wizzwizz4 Apr 27 '19 at 20:32
  • Yup, just checked; this _is_ the case. It's easiest to make occur on the first line, by setting the buffer size to one more than the first line. – wizzwizz4 Apr 27 '19 at 20:33
  • @Nick I still don't understand why you had a `'charmap' codec` error; this code uses the `'utf-8'` codec. – wizzwizz4 Apr 27 '19 at 20:43
  • @Nick Wait… Were you doing your tests in the Windows console using a pre-3.6 version of Python? If so, we're on something like 3.8 now – time to upgrade. – wizzwizz4 Apr 27 '19 at 20:47
  • no, using Python 3.7. I'm on Windows using Intellij and copy-pasted the gutenberg text into a local Windows file from the browser. Just confirming that your final code doesn't have the inconsistent lines problem :) – Nic Apr 27 '19 at 21:47
  • Thanks, this works well. I have proposed an edit which ensures that the function still works when buf_size is larger than the file (otherwise you'd get an UnboundError). – Bram Vanroy Apr 28 '19 at 17:58
  • Also, I'm not sure why you would need the lines `leftover = file_end % ...` and the following if-statement. It works fine without because if a chunk's given size is larger than the remainder of the file, you'll just get the final contents back. No errors. – Bram Vanroy Apr 28 '19 at 18:08
  • @BramVanroy Your patch was buggy. I've fixed it. Thanks, though! – wizzwizz4 Apr 28 '19 at 18:25
  • Buggy in what way? Could you elaborate? Seems to work well here. – Bram Vanroy Apr 28 '19 at 18:38
  • @BramVanroy I can't remember, but it would either output the file contents twice or add an extra blank chunk at the end. – wizzwizz4 Apr 28 '19 at 18:40
  • I can't reproduce what you are referring to. As provided, the solution does not work with files smaller than the buffer size. – Bram Vanroy Apr 28 '19 at 18:43
  • @BramVanroy Yeah; you almost completely fixed it. Iirc, your original fix made two empty newlines at the end. – wizzwizz4 Apr 28 '19 at 18:45
  • You don't understand. Your 'fix of my fix' is not working: you end up with 0 results if the buf_size is larger than the file. – Bram Vanroy Apr 28 '19 at 18:53
  • @BramVanroy Yup, you're right again. I fixed the bug in the fixing of the bug in the bug fix. – wizzwizz4 Apr 28 '19 at 21:22
  • 1
    I based [a small repo](https://github.com/BramVanroy/spacy-extreme) on this, dealing with memory issues in spaCy. – Bram Vanroy May 05 '19 at 13:08
  • @BramVanroy Nice! – wizzwizz4 May 05 '19 at 13:22
2

You have an interesting problem here. You have n processes that are given each the location of a chunk of data to process, but you can't provide the exact location of the chunks because you are dealing with lines and your locations are in bytes. Even if you split the file in lines to get the precise locations of chunks, you are experiencing some issues.

Here's a solution that is suboptimal (I assume that you do not want to process lines sequentially: it seems obvious):

  • cut the file in chunks as in your first try;
  • for each chunk, find the first and the last line feed. The chunk format is : B\nM\nA where B (before) and A (after) do not contains any line feed, but M may contain line feeds;
  • process the lines in M and put B\nA in a list at the current chunk index;
  • finally, process all B\nA elements.

This is suboptimal because once you have processed every M, you still have to process all B\nA and that last work must wait the other processes to be complete.

Here's the code:

def chunkify(file_end, buf_size=1024):
    """Yield chunks of `buf_size` bytes"""
    for chunk_start in range(0, file_end, buf_size):
        yield chunk_start, min(buf_size, file_end - chunk_start)

def process_batch(remainders, i, f, chunk_start, chunk_size):
    """Process a chunk"""
    f.seek(chunk_start)
    chunk = f.read(chunk_size)
    chunk, remainders[i] = normalize(chunk)
    # process chunk here if chunk is not None
    return chunk

def normalize(chunk):
    """Return `M, B\\nA`
    The chunk format is `B\\nM\\nA` where `B` (before) and `A` (after) do not contains any line feed,
    but `M` may contain line feeds"""
    i = chunk.find(b"\n")
    j = chunk.rfind(b"\n")
    if i == -1 or i == j:
        return None, chunk
    else:
        return chunk[i+1:j], chunk[:i]+chunk[j:]

Note that if the chunk has no middle (M part), then we return None as chunk and everything is sent to remainders.

Some tests:

text = """She returned bearing mixed lessons from a society where the tools of democracy still worked.
If you think you can sense a "but" approaching, you are right.
Elsewhere, Germany take on Brazil and Argentina face Spain, possibly without Lionel Messi.
What sort of things do YOU remember best?'
Less than three weeks after taking over from Lotz at Wolfsburg.
The buildings include the Dr. John Micallef Memorial Library.
For women who do not have the genes, the risk drops to just 2% for ovarian cancer and 12% for breast cancer.
In one interview he claimed it was from the name of the Cornish language ("Kernewek").
8 Goldschmidt was out of office between 16 and 19 July 1970.
Last year a new law allowed police to shut any bar based on security concerns.
But, Frum explains: "Glenn Beck takes it into his head that this guy is bad news."
Carrying on the Romantic tradition of landscape painting.
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
Dietler also said Abu El Haj was being opposed because she is of Palestinian descent.
The auction highlights AstraZeneca's current focus on boosting returns to shareholders as it heads into a wave of patent expiries on some of its biggest selling medicines including Nexium, for heartburn and stomach ulcers, and Seroquel for schizophrenia and bipolar disorder.
GAAP operating profit was $13.2 million and $7.1 million in the second quarter of 2008 and 2007, respectively.
Doc, Ira, and Rene are sent home as part of the seventh bond tour.
only I am sick of always hearing him called the Just.
Also there is Meghna River in the west of Brahmanbaria.
The explosives were the equivalent of more than three kilograms of dynamite - equal to 30 grenades," explained security advisor Markiyan Lubkivsky to reporters gathered for a news conference in Kyiv.
Her mother first took her daughter swimming at the age of three to help her with her cerebal palsy.
A U.S. aircraft carrier, the USS "Ticonderoga", was also stationed nearby.
Louis shocked fans when he unexpectedly confirmed he was expecting a child in summer 2015.
99, pp.
Sep 19: Eibar (h) WON 6-1"""

import io, os

def get_line_count(chunk):
    return 0 if chunk is None else len(chunk.split(b"\n"))

def process(f, buf_size):
    f.seek(0, os.SEEK_END)
    file_end = f.tell()
    remainders = [b""]*(file_end//buf_size + 1)
    L = 0
    for i, (start, n) in enumerate(chunkify(file_end, buf_size)):
        chunk = process_batch(remainders, i, f, start, n)
        L += get_line_count(chunk)

    print("first pass: lines processed", L)
    print("remainders", remainders)
    last_chunk = b"".join(remainders)
    print("size of last chunk {} bytes, {} lines".format(len(last_chunk), get_line_count(last_chunk)))
    L += get_line_count(last_chunk)
    print("second pass: lines processed", L)

process(io.BytesIO(bytes(text, "utf-8")), 256)
process(io.BytesIO(bytes(text, "utf-8")), 512)

with open("/home/jferard/prog/stackoverlfow/ep+gutenberg+news+wiki.txt", 'rb') as f:
    process(f, 4096)
with open("/home/jferard/prog/stackoverlfow/ep+gutenberg+news+wiki.txt", 'rb') as f:
    process(f, 16384)

Output:

first pass: lines processed 18
remainders [b'She returned bearing mixed lessons from a society where the tools of democracy still worked.\nWhat sort', b" of things do YOU remember best?'\nFor women who do not have the genes, the risk drops to just 2% for ovarian cancer and 12% for br", b'east cancer.\nBut, Frum explai', b'ns: "Glenn Beck takes it into his head that this guy is bad news."\nThe EAC was created in 2002 to help avoid a repeat of the dispu', b'ted 2000 presidential election.\nThe auction hig', b"hlights AstraZeneca's current focus on boosting returns to shareholders as it heads into a wave of patent expiries on some of its biggest selling medicines including Nexium, for heartburn and stomach ulcers, and Seroquel for schizophrenia and bipolar disor", b'der.\nAlso there is Meghn', b'a River in the west of Brahmanbaria.\nHer mother first to', b'ok her daughter swimming at the age of three to help her with her cerebal palsy.\nS', b'ep 19: Eibar (h) WON 6-1']
size of last chunk 880 bytes, 9 lines
second pass: lines processed 27

first pass: lines processed 21
remainders [b'She returned bearing mixed lessons from a society where the tools of democracy still worked.\nFor women who do not have the genes, the risk drops to just 2% for ovarian cancer and 12% for br', b'east cancer.\nThe EAC was created in 2002 to help avoid a repeat of the dispu', b"ted 2000 presidential election.\nThe auction highlights AstraZeneca's current focus on boosting returns to shareholders as it heads into a wave of patent expiries on some of its biggest selling medicines including Nexium, for heartburn and stomach ulcers, and Seroquel for schizophrenia and bipolar disor", b'der.\nHer mother first to', b'ok her daughter swimming at the age of three to help her with her cerebal palsy.\nSep 19: Eibar (h) WON 6-1']
size of last chunk 698 bytes, 6 lines
second pass: lines processed 27

first pass: lines processed 96963
remainders [b'She returned bearing mixed lessons from a society where the tools of democracy still worked, but where the native Dutch were often less than warm to her and her fellow exiles.\nOne of the Ffarquhar ', ...,  b'the old device, Apple will give customers a gift card that can be applied toward the purchase of the new iPhone.']
size of last chunk 517905 bytes, 3037 lines
second pass: lines processed 100000

first pass: lines processed 99240
remainders [b'She returned bearing mixed lessons from a society where the tools of democracy still worked, but where the native Dutch were often less than warm to her and her fellow exiles.\nSoon Carroll was in push-up position walking her hands tow', b'ard the mirror at one side of the room while her feet were dragged along by the casual dinnerware.\nThe track "Getaway" was inspired by and allud', ..., b'the old device, Apple will give customers a gift card that can be applied toward the purchase of the new iPhone.']
size of last chunk 130259 bytes, 760 lines
second pass: lines processed 100000

the last example show you can process 99,240 out of 100,000 lines in parallel, but you have to process the last 760 lines (130kio) after all the processes are complete.

Note on concurrency: each subprocess owns a fixed cell of the remainders list, hence there should be no memory corruption. It might be cleaner to store each remainder in its own process object (a wrapper around the real subprocess) and join all the remainders once the processes are finished.

jferard
  • 7,835
  • 2
  • 22
  • 35
  • This is in fact exactly the solution that I came up with as well, following the accepted solution and further optimising it. You can find it in [this repo](https://github.com/BramVanroy/spacy-extreme). – Bram Vanroy May 05 '19 at 13:09
  • @BramVanroy Maybe there is something I don't understand, but I think my solution is different because I tried to avoid something like the line `lines, left = process_batch(fin, start, size, last, left)` (in your solution). I don't see how you will manage to execute multiple `process_batch` concurrently, given the second instance needs (or seems to need) the `left` of the first instance, etc. – jferard May 05 '19 at 14:20
  • I used the accepted solution as a starting point, and created something different (see the link that I posted). That thing that I created, based on the accepted answer, is almost identical to your proposed solution. We independently got the same solution. – Bram Vanroy May 05 '19 at 15:18
  • @BramVanroy Okay, I didn't understand. That's cool! (BTW I had never heard of spaCy before.) – jferard May 05 '19 at 17:07
1

When files are oppenned in text mode (your second code example), thenread treat size argument as a "number of characters" (not bytes), but seek and tell are related to current position in file for "empty buffer", so:

  • you can calculate chunk size (for use by read) from len(l)
  • using file_end = stat(pfin).st_size to calculate size of last chunk is not correct (because for utf-8 encoding, number of characters for non-latin alphabets may not equal number of used bytes)

  • f.tell() still can't be used to calculate chunk size, but gives correct result for chunk_start. I think this is somehow related to the buffering of TextIOWrapper: tell gives the info about buffer+decoder state, and not about real position in text-stream. You can look at the reference implementation(def _read_chunk, def tell) and see that it's all complicated and no-one should trust to deltas calculated from different tell/seek calls ("# Grab all the decoded text (we will rewind any extra bits later)." gives another hint to the reasons for "incorrect" positions)

Seek/tell work correctly for "seeking" but can't be used to calculate number of characters between tell-s (and even number of bytes will be not correct). To get correct byte deltas binary non-buffered mode should be used (with open(path, 'rb', buffering=0) as f: ...), but in this case developer should ensure that all the reads return "full characters" (in "utf-8" different characters have different byte-length)

But simply using chunk_size + =len(l) solves all the problems, so you can keep opening files using text-mode! Next modified version of your code seems to work as expected:

from functools import partial


def chunkify(pfin, max_lines=1500):
    with open(pfin, 'r', encoding='utf-8') as f:
        chunk_start = f.tell()
        chunk_size = 0
        done = True

        for idx, l in enumerate(iter(f.readline, '')):
            chunk_size += len(l)
            done = False
            if idx != 0 and idx % max_lines == 0:
                yield chunk_start, chunk_size
                done = True
                chunk_start = f.tell()
                chunk_size = 0

        if not done:
            yield chunk_start, chunk_size


def process_batch(pfin, chunk_start, chunk_size):
    with open(pfin, 'r', encoding='utf-8') as f:
        f.seek(chunk_start)
        chunk = f.read(chunk_size).splitlines()

    batch = list(filter(None, chunk))

    return batch


if __name__ == '__main__':
    fin = r'data/100000-ep+gutenberg+news+wiki.txt'

    process_func = partial(process_batch, fin)
    lines_n = 0

    prev_last = ''
    for start, size in chunkify(fin):
        lines = process_func(start, size)

        if not lines:
            continue

        # print first and last ten sentences of batch
        for line in lines[:10]:
            print(line)
        print('...')
        for line in lines[-10:]:
            print(line)
        print('\n')

        lines_n += len(lines)

    print(lines_n)
imposeren
  • 4,142
  • 1
  • 19
  • 27