0

I'm trying to read a file in chunks of multiple lines. For example, if a file has 100 lines and I want each chunk to have 10 lines, then there should be 10 chunks. Then should be able to extract the chunk as following:

# Here 'read_chunk' function should return a generator.
for chunk in read_chunk(file_path="./file.txt", line_count=10):
    print(chunk)

This is how I attempted it.

from typing import Generator

def read_chunk(
    *,
    file_path: str,
    line_count: int = 10,  # Number of chunked line.
) -> Generator[str, None, None]:

    """Read a file in chunks of 'line_count' lines."""

    with open(file_path, "r") as f:
        chunk = []
        for idx, line in enumerate(f):
            if line.strip():
                chunk.append(line)

            if not idx == 0 and idx % line_count == 0:
                yield "\n".join(chunk)
                chunk = []

        # This returns the last chunk.
        yield "\n".join(chunk)

Let's run this on the following file:

# file.txt

* [What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with Example - Richard Peterson](https://www.guru99.com/database-normalization.html) 
-> Normalization roughly means deduplication of data in a table by leveraging foreign keys, multiple tables, and intermediary join tables. 
This article explains it in finer detail.

* [OLTP vs OLAP System](https://www.guru99.com/oltp-vs-olap.html) 
-> OLTP is an online transactional system that manages database modification whereas OLAP is an online analysis and data retrieving process.


for chunk in read_chunk(file_path='./file.txt', line_count=2):
    print('============\n')      # This is to discern between the chunks better.
    print(chunk)
    print('============\n')

This returns:

============

# file.txt

* [What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with Example - Richard Peterson](https://www.guru99.com/database-normalization.html)

============

============

-> Normalization roughly means deduplication of data in a table by leveraging foreign keys, multiple tables, and intermediary join tables.

This article explains it in finer detail.

============

============

* [OLTP vs OLAP System](https://www.guru99.com/oltp-vs-olap.html)

============

============

-> OLTP is an online transactional system that manages database modification whereas OLAP is an online analysis and data retrieving process.

============

The output looks alright in the beginning and then it doesn't make sense to me. Shouldn't there be a single chunk with 2 lines at the end instead of two with 1 line? Also, is there a better way of doing this?

Redowan Delowar
  • 1,580
  • 1
  • 14
  • 36
  • Do you want help with this code or the `itertools.islice` oneliner that does everything? – timgeb Sep 10 '21 at 15:53
  • If there's a better way of doing this, then sure. Please add your answer with the `itertools.islice` solution. Thanks! – Redowan Delowar Sep 10 '21 at 15:55
  • 1
    check out https://stackoverflow.com/questions/6335839/python-how-to-read-n-number-of-lines-at-a-time – timgeb Sep 10 '21 at 16:03
  • @timgeb in the future, please vote to close duplicate questions as duplicates, rather than simply giving the link. As you have a gold badge in the Python tag, you [can now close Python duplicates unilaterally](https://meta.stackoverflow.com/tags/dupehammer/info). – Karl Knechtel Aug 01 '22 at 23:40
  • @KarlKnechtel I blew my close vote prematurely for another reason. – timgeb Aug 02 '22 at 08:49

0 Answers0