4

I am reading chunks of data that is an API response using the following code:

d = zlib.decompressobj(zlib.MAX_WBITS|16)  # for gzip
for i in range(0, len(data), 4096):
    chunk = data[i:i+4096]
    # print(chunk)
    str_chunk = d.decompress(chunk)
    str_chunk = str_chunk.decode()
    # print(str_chunk)
    if '"@odata.nextLink"' in str_chunk:
        ab = '{' + str_chunk[str_chunk.index('"@odata.nextLink"'):len(str_chunk)+1]
        ab = ast.literal_eval(ab)
        url = ab['@odata.nextLink']
        return url

An example of this working is: "@odata.nextLink":"someurl?$count=true

It works in most cases but sometimes this key value pair gets cut off and it appears something like this: "@odata.nextLink":"someurl?$coun

I can play around with the number of bits in this line for i in range(0, len(data), 4096) but that doesn't ensure that in some cases the data doesn't cut off as the page sizes (data size) can be different for each page size.

How can I ensure that this key value pair is never cut off. Also, note that this key value pair is the last line/ last key-value pair of the API response.

P.S.: I can't play around with API request parameters.

Even tried reading it backwards but this gives a header incorrect issue:

for i in range(len(data), 0, -4096):
                chunk = data[i -4096: i]
                str_chunk = d.decompress(chunk)
                str_chunk = str_chunk.decode()
                if '"@odata.nextLink"' in str_chunk:
                    ab = '{' + str_chunk[str_chunk.index('"@odata.nextLink"'):len(str_chunk)+1]
                    ab = ast.literal_eval(ab)
                    url = ab['@odata.nextLink']
                    #print(url)
                    return url

The above produces the following error which is really strange:

str_chunk = d.decompress(chunk)
zlib.error: Error -3 while decompressing data: incorrect header check
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
qwerty
  • 887
  • 11
  • 33
  • 1
    You certainly cannot read a compressed stream backwards, and you are getting exactly the expected error since the first thing it's looking for is a gzip header. – Mark Adler Aug 30 '22 at 21:28
  • To be clear: the data that should be passed to `ast.literal_eval` is the Python representation of a dictionary? And the dictionary will always list `"@odata.nextLink"` as the **first** key? (Is it actually intended to represent a Python dictionary in Python syntax, or is it in fact JSON?) – Karl Knechtel Sep 06 '22 at 19:28
  • I can't understand how this code is supposed to work, actually. `ast.literal_eval` will not accept trailing characters after a valid literal, either - e.g. `ast.literal_eval("{'foo': 'bar'} extra stuff")` **does not work**. So how exactly is `ab` going to be valid data, if we blindly take everything until the end of the chunk? – Karl Knechtel Sep 06 '22 at 19:33
  • BTW: a slice until `len(str_chunk)` captures everything and the `+1` is unnecessary; and it's also possible to do this slice by just omitting the end, like `str_chunk[str_chunk.index('"@odata.nextLink"'):]` - notice the colon with nothing after it. Please read [Understanding slicing](/q/509211/). – Karl Knechtel Sep 06 '22 at 19:33
  • @KarlKnechtel To your first two questions, the `ast.literal_eval` is going to evaluate whether the data passed to it is a valid dictionary or not. Based on the use case the data passed onto it has to be a dictionary, there is no `extra stuff` that'll present itself. To your third point, yes you're correct there. – qwerty Sep 06 '22 at 19:47
  • "the ast.literal_eval is going to evaluate whether the data passed to it is a valid dictionary or not." I don't follow. If I try that at the command line, I can easily demonstrate that it doesn't work. There should be `extra stuff` any time that the closing `}` of the dictionary syntax doesn't *happen* to align with a chunk boundary. – Karl Knechtel Sep 06 '22 at 20:04
  • @KarlKnechtel Yes, that is where the issue was, when the chunk cut off before the end of dictionary (basically when the chunk wasn't the last chunk in the loop). – qwerty Sep 06 '22 at 20:12
  • Yes; and my point is, why couldn't it **just as easily** cut off **after** the end of the dictionary, leaving extra stuff after the `}`? – Karl Knechtel Sep 06 '22 at 20:13
  • @KarlKnechtel Because there is no extra stuff after the `}`. I don't think I get what you're getting at. This dictionary is the last thing in the data being received. There is no data expected post this dictionary which is why it's looking until the end of file. – qwerty Sep 06 '22 at 20:15
  • 1
    Oh, well in *that* case, once you find the start by looking a chunk at a time, chunking *no longer helps you* and it is necessary to read to the end of the overall file. (The two-chunk trick will still be necessary because *the marker* could be split across chunks. But I understand the overall problem now.) – Karl Knechtel Sep 06 '22 at 20:17
  • @KarlKnechtel Gotcha, yeah, chunking is only helping in terms of memory management in the sense that I am not loading the entire data at once. – qwerty Sep 06 '22 at 20:21

2 Answers2

2

str_chunk is a contiguous sequence of bytes from the API response that can start anywhere in the response, and end anywhere in the response. Of course it will sometimes end in the middle of some semantic content.

(New information from comment that OP neglected to put in question. In fact, still not in question. OP requires that entire uncompressed content not be saved in memory.)

If "@odata.nextLink" is a reliable marker for what you're looking for, then keep the last two decompressed chunks, concatenate those, then look for that marker. Once found, continue to read more chunks, concatenating them, until you have the full content you're looking for.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • The reason I am chunking is so that I can avoid reading the whole response as a whole to avoid memory issues. I tried reading it backwards but unfortunately that gives out this error: `zlib.error: Error -3 while decompressing data: incorrect header check`. Edited the question to show that as well. – qwerty Aug 30 '22 at 21:05
  • My apologies, I thought the code itself would indicate that chunking was in fact necessary. Thank you for your answer. – qwerty Aug 31 '22 at 01:41
  • 1
    @qwerty What Mark pointed out here is, that you could simply read from the stream until you encounter the marker twice. Then you can be confident that the first marker can be read completely. This assumes that the data between the two markers do not exceed the memory size, that could be a fair assumption, I do not know. In other words, you would still be chunking but you might process multiple chunks at once. – asynts Sep 06 '22 at 09:08
1

If the approach that Mark suggested in his answer is sufficient for you, it's probably a good compromise and there is no need to over engineer it.

However, more generally, if you want to extract information from a stream the "proper" way of doing it is to parse the text character-by-character. Thereby, you can avoid any issues with chunk boundaries.

For example, let's say that we want to extract values that are surrounded with @ symbols:

Lorem ipsum dolor @sit amet, consectetur@ adipiscing elit. Mauris
dapibus fermentum orci, vitae commodo odio suscipit et. Etiam
pellentesque turpis ut leo malesuada, quis scelerisque turpis condimentum.
Nulla consequat velit id pretium bibendum. Suspendisse potenti. Ut id
sagittis ante, quis tempor mauris. Sed volutpat sem a purus malesuada
varius. Pellentesque sit amet dolor at velit tristique fermentum. In
feugiat mauris ut @diam viverra aliquet.@ Morbi quis eros interdum,
lacinia mi at, suscipit lectus.

Donec in magna sed mauris auctor sollicitudin. Aenean molestie, diam sed 
aliquet malesuada, eros nunc ornare nunc, at bibendum ligula nulla et eros. 
Maecenas posuere eleifend elementum. Ut bibendum at arcu quis aliquam. Aliquam 
erat volutpat. Fusce luctus libero ac nisi lobortis lacinia. Aliquam ac rutrum 
odio. In hac habitasse platea dictumst. Vestibulum semper ullamcorper commodo. 
In hac habitasse platea dictumst. @Aenean ut pulvinar magna.@ Donec at euismod 
erat, eu iaculis metus. Proin vulputate mollis arcu, ut efficitur ligula 
fermentum et. Suspendisse tincidunt ultricies urna quis congue. Interdum et 
malesuada fames ac ante ipsum primis in faucibus. 

This can be done by creating a generator that parses the incoming stream and extracts a sequence of values:

import io
import typing

# Suppose this file is extremely long and doesn't fit into memory.
input_file = io.BytesIO(b"""\
Lorem ipsum dolor @sit amet, consectetur@ adipiscing elit. Mauris
dapibus fermentum orci, vitae commodo odio suscipit et. Etiam
pellentesque turpis ut leo malesuada, quis scelerisque turpis condimentum.
Nulla consequat velit id pretium bibendum. Suspendisse potenti. Ut id
sagittis ante, quis tempor mauris. Sed volutpat sem a purus malesuada
varius. Pellentesque sit amet dolor at velit tristique fermentum. In
feugiat mauris ut @diam viverra aliquet.@ Morbi quis eros interdum,
lacinia mi at, suscipit lectus.

Donec in magna sed mauris auctor sollicitudin. Aenean molestie, diam sed 
aliquet malesuada, eros nunc ornare nunc, at bibendum ligula nulla et eros. 
Maecenas posuere eleifend elementum. Ut bibendum at arcu quis aliquam. Aliquam 
erat volutpat. Fusce luctus libero ac nisi lobortis lacinia. Aliquam ac rutrum 
odio. In hac habitasse platea dictumst. Vestibulum semper ullamcorper commodo. 
In hac habitasse platea dictumst. @Aenean ut pulvinar magna.@ Donec at euismod 
erat, eu iaculis metus. Proin vulputate mollis arcu, ut efficitur ligula 
fermentum et. Suspendisse tincidunt ultricies urna quis congue. Interdum et 
malesuada fames ac ante ipsum primis in faucibus.
""")

# This is a generator function which is essentially a custom iterator.
def extract_marked_values(raw_input_stream: typing.BinaryIO):
    # This will make the stream buffered and ensures that 'read(1)' is not extremely slow.
    # On top of that, it decodes the stream into UTF-8, thus the result is of type 'str' and not 'bytes'.
    text_input_stream = io.TextIOWrapper(raw_input_stream, encoding="utf-8")

    # Go through the text character by character and parse it.
    # Once a complete value return it with 'yield'.
    current_value: str = None
    while character := text_input_stream.read(1):
        if current_value is None:
            if character == "@":
                current_value = ""
        else:
            if character == "@":
                yield current_value
                current_value = None
            else:
                current_value += character

for value in extract_marked_values(input_file):
    print(value)

The trick here is that the parser is able to go character by character. Thus it doesn't have to care about the boundaries between the chunks. (The chunks still exist, TextIOWrapper will internally read the input in chunks.)

You can generalize this to your problem. If your syntax is very complex you can break it up into multiple steps where you first extract the relevant substring and then in a second step extract the information from it.


When parsing more complex input, you don't necessarily need to write the code to process each character one-by-one. Instead you can create abstractions to help.

For example, a Lexer class that wraps the stream and provides methods like lexer.try_consume("<marker>") or something like that.

asynts
  • 2,213
  • 2
  • 21
  • 35
  • Won't parsing it character by character affect the performance rather than searching for the keyword? – qwerty Sep 06 '22 at 19:48
  • Ultimately, the logic that searches for a keyword will have to look at each character anyways. That's what I suggested by adding helper functions. You can create a function that searches for a keyword by comparing the characters one-by-one. – asynts Sep 06 '22 at 19:53
  • 1
    If you mean the `read(1)` that should not be a problem because `TextIOWrapper` has a buffer internally, it will do a `read(1024)` or something like that and then hand out the result one byte at a time. (I didn't test it, but the documentation says that there is a buffer in there.) – asynts Sep 06 '22 at 19:56