I'm trying to read a file in chunks of multiple lines. For example, if a file has 100 lines and I want each chunk to have 10 lines, then there should be 10 chunks. Then should be able to extract the chunk as following:
# Here 'read_chunk' function should return a generator.
for chunk in read_chunk(file_path="./file.txt", line_count=10):
print(chunk)
This is how I attempted it.
from typing import Generator
def read_chunk(
*,
file_path: str,
line_count: int = 10, # Number of chunked line.
) -> Generator[str, None, None]:
"""Read a file in chunks of 'line_count' lines."""
with open(file_path, "r") as f:
chunk = []
for idx, line in enumerate(f):
if line.strip():
chunk.append(line)
if not idx == 0 and idx % line_count == 0:
yield "\n".join(chunk)
chunk = []
# This returns the last chunk.
yield "\n".join(chunk)
Let's run this on the following file:
# file.txt
* [What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with Example - Richard Peterson](https://www.guru99.com/database-normalization.html)
-> Normalization roughly means deduplication of data in a table by leveraging foreign keys, multiple tables, and intermediary join tables.
This article explains it in finer detail.
* [OLTP vs OLAP System](https://www.guru99.com/oltp-vs-olap.html)
-> OLTP is an online transactional system that manages database modification whereas OLAP is an online analysis and data retrieving process.
for chunk in read_chunk(file_path='./file.txt', line_count=2):
print('============\n') # This is to discern between the chunks better.
print(chunk)
print('============\n')
This returns:
============
# file.txt
* [What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with Example - Richard Peterson](https://www.guru99.com/database-normalization.html)
============
============
-> Normalization roughly means deduplication of data in a table by leveraging foreign keys, multiple tables, and intermediary join tables.
This article explains it in finer detail.
============
============
* [OLTP vs OLAP System](https://www.guru99.com/oltp-vs-olap.html)
============
============
-> OLTP is an online transactional system that manages database modification whereas OLAP is an online analysis and data retrieving process.
============
The output looks alright in the beginning and then it doesn't make sense to me. Shouldn't there be a single chunk with 2 lines at the end instead of two with 1 line? Also, is there a better way of doing this?