1

I have huge txt file like this (input.txt):

word1 word2 word3 word2
word5
word6 word7 word8 word9 word10

I want to transform it to have 3 words per line like this (output.txt):

word1 word2 word3 
word2 word5 word6 
word7 word8 word9 
word10

Of course last line could have less than 3 words.

Note 1: 3 is a parameter (more realistic value is 200)

Note 2: words are separated by space, so they could be obtained by split(" ")

I've solution that works when I can load whole input.txt into memory and process it, but my input.txt is around 300GB so it doesn't fit. Loading whole input.txt is not necesarry, as I think it could be processed in 'stream fashion' so no real memory problem, but it shouldn't take ages.

Pure python solution would be great, but if more performant or concise solution with some popular library exist, that also will be fine.

Quant Christo
  • 1,275
  • 9
  • 23
  • This sounds contradictory to me: `I've solution that works when I can load whole input.txt into memory and process it, but my input.txt is around 300GB so it doesn't fit.` – Fiddling Bits Aug 11 '22 at 19:19
  • It would be easier to convert original file into several files limited by a certain number of lines. That's not an option? – Fiddling Bits Aug 11 '22 at 19:21
  • If I've file of size e.g. less than 1GB I can load it whole to memory and basically this is fiddling with list of lists, but I can't load 300GB, so it need to be done "bit by bit" – Quant Christo Aug 11 '22 at 19:23
  • Your best bet is to use the syntax `with open("file.txt") as f: for line in f: ...`, as described in [the doc](https://docs.python.org/3.10/library/io.html#io.IOBase.readlines) – crissal Aug 11 '22 at 19:24
  • @crissal I've hoped that there is "higher level solution", processing line by line would be a little tricky, as I need to accumulate enough words to create line in output file – Quant Christo Aug 11 '22 at 19:28
  • 1
    You can also `f.read(anyAmountOfBytes)` you like, like `1`, and build around it your own text parser – crissal Aug 11 '22 at 19:31
  • 2
    @FiddlingBits I think "that works when I can" was intended as "that would work if I could". – Karl Knechtel Aug 12 '22 at 02:09

2 Answers2

3

Reading the input file line by line and using a deque as a "buffer of words" that you empty when needed should be memory efficient.

from collections import deque
from itertools import islice

WORDS_PER_LINE = 3

buffer = deque()
with open("input.txt") as f_in, open("output.txt", "wt") as f_out:
    for line in f_in:
        buffer.extend(line.split())
        while len(buffer) >= WORDS_PER_LINE:
            f_out.write(" ".join((buffer.popleft() for _ in range(WORDS_PER_LINE))) + "\n")

    f_out.write(" ".join(islice(buffer, WORDS_PER_LINE)) + "\n")
Terry Spotts
  • 3,527
  • 1
  • 8
  • 21
1

Make a lazy iterator of words in the file:

def words_of(fileobj):
    """Given a file-like object, generate its words.
    Does not open or close the file. Yields words from the current point."""
    for line in fileobj:
        yield from line.split()

Then set up to iterate in chunks, using one of the versions that accepts a lazy iterator:

from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

Then simply iterate over the input and write to the output:

def regroup(ifile, ofile, size):
    with open(ifile) as f_in, open(ofile, 'w') as f_out:
        for group in grouper(words_of(f_in), size):
            f_out.write(' '.join(group) + '\n')
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153