If the approach that Mark suggested in his answer is sufficient for you, it's probably a good compromise and there is no need to over engineer it.
However, more generally, if you want to extract information from a stream the "proper" way of doing it is to parse the text character-by-character. Thereby, you can avoid any issues with chunk boundaries.
For example, let's say that we want to extract values that are surrounded with @
symbols:
Lorem ipsum dolor @sit amet, consectetur@ adipiscing elit. Mauris
dapibus fermentum orci, vitae commodo odio suscipit et. Etiam
pellentesque turpis ut leo malesuada, quis scelerisque turpis condimentum.
Nulla consequat velit id pretium bibendum. Suspendisse potenti. Ut id
sagittis ante, quis tempor mauris. Sed volutpat sem a purus malesuada
varius. Pellentesque sit amet dolor at velit tristique fermentum. In
feugiat mauris ut @diam viverra aliquet.@ Morbi quis eros interdum,
lacinia mi at, suscipit lectus.
Donec in magna sed mauris auctor sollicitudin. Aenean molestie, diam sed
aliquet malesuada, eros nunc ornare nunc, at bibendum ligula nulla et eros.
Maecenas posuere eleifend elementum. Ut bibendum at arcu quis aliquam. Aliquam
erat volutpat. Fusce luctus libero ac nisi lobortis lacinia. Aliquam ac rutrum
odio. In hac habitasse platea dictumst. Vestibulum semper ullamcorper commodo.
In hac habitasse platea dictumst. @Aenean ut pulvinar magna.@ Donec at euismod
erat, eu iaculis metus. Proin vulputate mollis arcu, ut efficitur ligula
fermentum et. Suspendisse tincidunt ultricies urna quis congue. Interdum et
malesuada fames ac ante ipsum primis in faucibus.
This can be done by creating a generator that parses the incoming stream and extracts a sequence of values:
import io
import typing
# Suppose this file is extremely long and doesn't fit into memory.
input_file = io.BytesIO(b"""\
Lorem ipsum dolor @sit amet, consectetur@ adipiscing elit. Mauris
dapibus fermentum orci, vitae commodo odio suscipit et. Etiam
pellentesque turpis ut leo malesuada, quis scelerisque turpis condimentum.
Nulla consequat velit id pretium bibendum. Suspendisse potenti. Ut id
sagittis ante, quis tempor mauris. Sed volutpat sem a purus malesuada
varius. Pellentesque sit amet dolor at velit tristique fermentum. In
feugiat mauris ut @diam viverra aliquet.@ Morbi quis eros interdum,
lacinia mi at, suscipit lectus.
Donec in magna sed mauris auctor sollicitudin. Aenean molestie, diam sed
aliquet malesuada, eros nunc ornare nunc, at bibendum ligula nulla et eros.
Maecenas posuere eleifend elementum. Ut bibendum at arcu quis aliquam. Aliquam
erat volutpat. Fusce luctus libero ac nisi lobortis lacinia. Aliquam ac rutrum
odio. In hac habitasse platea dictumst. Vestibulum semper ullamcorper commodo.
In hac habitasse platea dictumst. @Aenean ut pulvinar magna.@ Donec at euismod
erat, eu iaculis metus. Proin vulputate mollis arcu, ut efficitur ligula
fermentum et. Suspendisse tincidunt ultricies urna quis congue. Interdum et
malesuada fames ac ante ipsum primis in faucibus.
""")
# This is a generator function which is essentially a custom iterator.
def extract_marked_values(raw_input_stream: typing.BinaryIO):
# This will make the stream buffered and ensures that 'read(1)' is not extremely slow.
# On top of that, it decodes the stream into UTF-8, thus the result is of type 'str' and not 'bytes'.
text_input_stream = io.TextIOWrapper(raw_input_stream, encoding="utf-8")
# Go through the text character by character and parse it.
# Once a complete value return it with 'yield'.
current_value: str = None
while character := text_input_stream.read(1):
if current_value is None:
if character == "@":
current_value = ""
else:
if character == "@":
yield current_value
current_value = None
else:
current_value += character
for value in extract_marked_values(input_file):
print(value)
The trick here is that the parser is able to go character by character. Thus it doesn't have to care about the boundaries between the chunks. (The chunks still exist, TextIOWrapper
will internally read the input in chunks.)
You can generalize this to your problem. If your syntax is very complex you can break it up into multiple steps where you first extract the relevant substring and then in a second step extract the information from it.
When parsing more complex input, you don't necessarily need to write the code to process each character one-by-one. Instead you can create abstractions to help.
For example, a Lexer
class that wraps the stream and provides methods like lexer.try_consume("<marker>")
or something like that.