5

I am using the Gmail API to download e-mails. When these e-mails are HTML, I try to convert them to PDF using Python's pdfkit.

This works in many cases but in some cases the html payload contains image tags like src=“cid:169abdc4ae2c4da871d2”.

It seems that this "cid" tag refers to an image sent as part of the multipart e-mail, but this cannot be processed by PDFkit. Error is:

wkhtmltopdf reported an error:
Loading pages (1/6)
Error: Failed to load cid:169abf0d0cdfffb7aff2, with network status code 301 and http status code 0 - Protocol "cid" is unknown

How can I solve this? Is there a way to convert this HTML I get from the gmail payload to standard HTML with proper picture sources?

Alexis Eggermont
  • 7,665
  • 24
  • 60
  • 93
  • Try the steps here? https://stackoverflow.com/questions/55130360/python-download-as-pdf-all-emails-from-a-label-gmail – Ezra Jul 02 '21 at 20:33
  • Please read [this answer](https://stackoverflow.com/a/53658868/5022913). Hope it helps! – s0mbre Jul 07 '21 at 04:26

1 Answers1

2

You can use "remove_tags" method in the w3lib Package:

Remove all tags:

import w3lib.html
doc = '<div><p><b>This is a link:</b> <a href="http://www.example.com">example</a></p></div>'
w3lib.html.remove_tags(doc)
'This is a link: example'

Remove specific tags:

 w3lib.html.remove_tags(doc, which_ones=('a','b'))
'<div><p>This is a link: example</p></div>'