Using lambda (node.js) + puppeteer to .fetch() Excel files and write to S3, file format is wrong

Question

I am hitting a OneDrive/Sharepoint shared link of an Excel file, trying to download it and save to S3.

CSV does work; .xlsx does not

I am using Puppeteer so the code looks something like this:

var downloadURL = "https://netorgft7979143-my.sharepoint.com/:x:/g/personal/gabe_scoop_report/ETCqa1EwxrVNiPYD0aLIq44BJUpEFLYIhcKOFWXuNnYPXQ?download=1"

const buffer = await page.evaluate(({downloadURL}) =>
    {
      return fetch(downloadURL, {
        method: 'GET'
      }).then(r => r.text());
    }, {downloadURL});

...

const s3result = await s3
    .upload({
      Bucket: S3BucketPath,
      Key: `${Date.now()}.csv`,
      Body: buffer,
      ContentType: 'text'})
    .promise()

Again, this totally works when the endpoint is CSV. When it's an Excel file, the bits do get written to S3; but, not as a valid Excel file.

The URL above is real (sample data so feel free to hit it) and if you run this code, you will see an 11k file written to S3, but Excel will complain the format is invalid.

I am 99%+ sure it has something to do with binary vs. text, and have spent 2 days poring over SO trying anything from base64 conversion to .blob() or .buffer(), different content-types for S3... but nothing did the trick. I am also 99%+ sure Puppeteer has nothing to do with the problem, tho wrapping the .fetch() within page.evaluate() does make it harder to do things like then(r => r.buffer()) complaining buffer is not a function...

Ideas? Thx!

Perhaps the problem is caused by `ContentType: 'text'`? See: [excel - What is a correct MIME type for .docx, .pptx, etc.? - Stack Overflow](https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc) — John Rotenstein, Jun 06 '21 at 22:18
Thanks @JohnRotenstein but wish it were as trivial. My feeling is it's more about how the bits are fetch()ed than how they're written. I tried ```ContentType: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'``` and the result on disk (i.e. S3) is identical! I literally looked into the binary content of the 2 files produced on S3 with contnetType being text and excel-specific (as per the doc you referenced) and the 2 are identical down to the byte on S3. — gabe, Jun 06 '21 at 22:50
Are they identical to the original file? If not, can you see what is different? eg is the file length the same? — John Rotenstein, Jun 06 '21 at 23:17
@JohnRotenstein I haven't compared the "good" download (by just hitting the URL and saving locally to my laptop) and the "bad" version (what ends up on S3) at the byte level yet; but the difference in size is obvious: 11k vs. 7k so clearly more than a trivial difference. Thx — gabe, Jun 07 '21 at 00:04
A larger file suggests that it might be converting formats, such as base64. Can you run the same code on your own computer, and check the contents of the original file, downloaded file and uploaded file? — John Rotenstein, Jun 07 '21 at 00:07
Thx, @JohnRotenstein. Bad: ```PK?? ???????!?????]???????????????????docProps/app.xml??????MO???0???????H??????(???5?H???????L???K\&8???????????"Z;J?????????{???q???f???????????????C/N???r ?????????R?zj?v???|????.n?????????a???zB???????Yn??????yI?!????Y0?s-????????R????ap???b?Yi) ???p???:Em??????????????z???`,??? 4???8????>??????B??????????a?E6l???6???>``` Good: ```PK?? ???????!???]?????????????docProps/app.xml??MO?0???H??(?5?H???L?K\&8?????"Z;J???{???q?f???????C/N?r ????R?zj?v?|??.n???a?zB???Yn???yI?!??Y0?s-????R??ap?b?Yi) ?p?:Em????????z?`,? 4?8????>??B?????a?E6l?6?>``` — gabe, Jun 07 '21 at 02:26
What type of transform would cause the file contents to be fairly similar, yet a bit different? Btw "bad" is Node.js->fetch()->fs.WriteFileSynch(). and "good" is downloading straight from a browser just hitting the link — gabe, Jun 07 '21 at 02:29

Using lambda (node.js) + puppeteer to .fetch() Excel files and write to S3, file format is wrong

0 Answers0