Parsing big text file using regex

Question

I have a huge text file (1 GB), where each "line" is separated by ##.
For example:

## sentence 1 ## sentence 2
## sentence 3

I'm trying to print the file according to the ## separation.

I tried the following code, but the read() function crush (because the size of the file).

import re

dataFile = open('post.txt', 'r')
p = re.compile('##(.+)')

iterator = p.finditer(dataFile.read())
for match in iterator:
    print (match.group())

dataFile.close()

Any ideas?

Post the expected output and a small sample input. – Ashwini Chaudhary Aug 12 '13 at 00:34 — Ashwini Chaudhary, Aug 12 '13 at 00:34

score 4 · Accepted Answer · answered Aug 12 '13 at 00:43

This will read the file in chunks (of chunksize bytes) thus avoiding memory issues related to reading too much of the file all at once:

import re
def open_delimited(filename, delimiter, *args, **kwargs):
    """
    http://stackoverflow.com/a/17508761/190597
    """
    with open(filename, *args, **kwargs) as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.split(delimiter, remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = pieces[-1]
        if remainder:
            yield remainder

filename = 'post.txt'
for chunk in open_delimited(filename, '##', 'r'):
    print(chunk)
    print('-'*80)

a bit overkill since I don't think his regex ever spans line boundaries, but still a useful tool to have in one's kit. — roippi, Aug 12 '13 at 00:46
Reading large files line-by-line is too slow. You'll do better by processing the file in chunks. — unutbu, Aug 12 '13 at 00:48

Vaibhav Aggarwal · Answer 2 · 2013-08-12T00:52:58.173

1

You can use islice.

from itertools import islice

file = open('file.txt', 'r')
while True:
  slice = islice(file, buffer)
  to_process = []
  for line in slice:
    to_process.append(line)
  if not to_process:
    break
  #process to_process list
file.close()

buffer is the number of lines you want to read at a time (you have to define the int).

edited Aug 12 '13 at 00:52

answered Aug 12 '13 at 00:42

Vaibhav Aggarwal

1,381
2
19
30

Parsing big text file using regex

2 Answers2

Linked