1

I'm using the Microsoft Graph API to retrieve Sharepoint document content from within a Python script. I search for documents with the https://graph.microsoft.com/v1.0/search/query endpoint, and then attempt to retrieve the document content via https://graph.microsoft.com/v1.0/sites/{site_id}/drives/{drive_id}/items/{item_id}/content. I want to write content as a .pdf to a blob storage for further processing.

Now, when I call the content endpoint with the Python requests library, I get the .pdf back as a string from the endpoint, which I can retrieve with response.text. This text looks as you would expect for .pdf content (snippet):

%PDF-1.7
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(nl-NL) /StructTreeRoot 29 0 R/MarkInfo<</Marked true>>/Metadata 117 0 R/ViewerPreferences 118 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 2/Kids[ 3 0 R 24 0 R] >>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 10 0 R/F3 12 0 R/F4 17 0 R/F5 19 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/XObject<</Image9 9 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 594.96 842.04] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<</Filter/FlateDecode/Length 3438>>
stream
x��\mS�8�N�A��EX�$�s[T

so what I try to do is write this content to a file like:

with open('pdffilefromsharepoint.pdf', 'w') as f:
  f.write(response.text)

Now this writes away to the PDF without error. However, when I open the document in a .pdf-reader I get just two empty pages with no content at all. Moreover, when I look at the raw contents of my original Sharepoint file and my .pdf file that was written via the result of the content gathered from the Graph API, they seem to be exactly identical: Same number of lines, and also seem to have the exact same content in it line-by-line.

One notable thing is that the original document is just 68kb, while the one written from the gathered API content is 113kb.

Has anyone ever tried to achieve a similar thing like this? Do I need a special package to write this content to a .pdf again from Python?

Tim
  • 147
  • 8
  • you may want to check the files encoding - are you writing binary or text? If you write text, you probably write using utf8 - where you should write raw bytes instead, not text. formatting a byte as utf-8 text may add more bytes to it hence the embiggened file. Compare the both files using a hex editor to check. – Patrick Artner Oct 25 '22 at 12:16
  • Hmmm yes, so the original file seems to be ANSI-encoded, while the eventual file is indeed encoded as a utf-8 file. This explains what's going wrong, however I still don't quite understand how to handle this. So I have this raw (ANSI-encoded) string now to start with. As far as I can understand, I want to write this to a `bytes` object and then write this bytes object to .pdf with mode 'wb'. However, bytes requests to know some type of encoding for the string to convert. 'ansi' is not a possible encoding, which leaves me stuck when trying to convert the pdf text to bytes. – Tim Oct 25 '22 at 13:03
  • Looks like a duplicate of: https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests (i.e. the question is not specific to sharepiont, it's more like 'how to download binary file with python') – Nikolay Oct 25 '22 at 14:04
  • Does this answer your question? [How to download image using requests](https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests) – Nikolay Oct 25 '22 at 14:06
  • Well the point is that the Graph API returns the content as a string that can be seen in the 'code snippet' in my original question. So the content doesn't seem to be binary to begin with. – Tim Oct 25 '22 at 14:17

0 Answers0