Python - split files

Question

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.

I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?

----Adding current code----

if not os.path.exists(output_path):
    os.makedirs(output_path)

memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)

outFile = open('output/tempfile', 'wb')

for chunk in memFile.iter_content(chunk_size=512):
    if chunk:
        outFile.write(chunk)

f = open('output/tempfile', 'rb').read().split('\r\n\r\n')

arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

How about running that regex on each chunk? That way, you can close the output file and opening a new one when the delimiting regex shows up — inspectorG4dget, Jul 31 '17 at 19:08
@inspectorG4dget But how can he/she be sure the bytes aren't split in the middle of the regex match? — Richard Dunn, Jul 31 '17 at 19:15
@sevredox I reckon your approach is about as good as it gets, however, there's no need to save the file to disk unless it's so massive it can't be comfortably kept in memory. Try streaming to a file object, searching that for your regex, and save the lines from the respective halves thereafter... — Richard Dunn, Jul 31 '17 at 19:19
@RichardDunn: Depending on the regex, OP will have to do some buffering for those edge cases — inspectorG4dget, Jul 31 '17 at 19:20
@inspectorG4dget True... If they overlap the bytes inspected in each buffer by at least the same amount of bytes used in the regex match, then they are guaranteed to find the match. But that's going to add overheads as some of the bytes are searched twice with each block. Generating a struct from the string that is to be matched, then checking bytewise for a match would have the benefit of being able to continue a match into the next block of bytes. (Not sure if I've phrased that well, let me know if you want clarification, OP...) — Richard Dunn, Jul 31 '17 at 19:29
@RichardDunn: I understand what you mean, but given OPs apparent novice level, I'm inclined to say it won't be the ideal solution in this case. Not to mention that adds more overhead in str/byte conversions, and tracking two different matches — inspectorG4dget, Jul 31 '17 at 20:04

Richard Dunn · Accepted Answer · 2017-08-01T20:38:40.693

Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.

Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.

from pprint import pprint

someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.

I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''

n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)

# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')

# our buffer
buff = bytearray()

count = 0
for bb in byteBlocks:
    buff += bb
    count += 1

    # every two blocks
    if (count % 2) == 0:

        if count == 2:
            start = 0
        else:
            start = len(matchBytes)

        # check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
        # this will check all the bytes only once...
        if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
            print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))

Update:

So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.

Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.

Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.

memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)

match = '\r\n\r\n'
data = ''

for chunk in memFile.iter_content(chunk_size=512):
    if chunk:
        data += chunk

f = data.split(match)

arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Wow, thanks @RichardDunn. My apologies, but I am pretty new at this, and I am just trying to figure out how to adapt this into my script. How do you recommend writing just the part after all of the regex matches to a file with this method? Maybe it would be helpful to explain that what I am doing here is removing a text header from a file that is supposed to be binary. Thanks! — sevredox, Aug 01 '17 at 13:17
More info in case it is helpful... I am receiving a SOAP response that is multipart with attachments. Unfortunately, SOAPpy doesn't support attachments and zeep and ZSI throw errors with the WSDL file that was offered. So, this is an attempted workaround.Thanks! — sevredox, Aug 01 '17 at 13:25
Okay, I'm at work so can't really look at this now, but from what you're saying I'd be more interested in where that header comes from, and how to avoid having to handle it in the first place. It's not clear in your question, but did you create the server side code? Or is this something that's not in your control. Alternatively, and again I don't have time at the moment, but have you checked the *requests* API for ways to retrieve the message body only? I'd be shocked if these two entities weren't easily separable using the provided methods... — Richard Dunn, Aug 01 '17 at 13:40
If you *are* writing the server, then maybe check out this post: https://stackoverflow.com/questions/27043402/python-requests-remove-the-content-length-header-from-post — Richard Dunn, Aug 01 '17 at 13:43
Hey Richard, thanks for your time on this. I do not have access to the server side. Here is the documentation on the request and response format: https://developer.cisco.com/site/webex-developer/develop-test/nbr-web-services-api/api-functions.gsp#downloadNBRStorageFile. I think maybe the best angle to approach this is trying to figure out why ZSI and zeep cannot consume the WSDL so I can leverage their built-in functionality for handling attachments. Thanks again for your help! — sevredox, Aug 01 '17 at 18:50
No problem, just hanging around a game lobby anyway... Can you update the question with a little code perhaps? Specifically, what you've currently got running between the request.post() and the files getting split. Cheers. — Richard Dunn, Aug 01 '17 at 19:09

Python - split files

1 Answers1