How to parse data-uri in python?

Question

HTML image elements have this simplified format:

<img src='something'>

That something can be data-uri, for example:

data:image/png;base64,iVBORw0KGg...

Is there a standard way of parsing this with python, so that I get content_type and base64 data separated, or should I create my own parser for this?

http://stackoverflow.com/questions/30267199/downloading-image-data-uris-from-webpages-via-beautifulsoup — furas, Nov 23 '15 at 12:04
Python 3 parses these natively https://docs.python.org/3/library/urllib.request.html#urllib.request.DataHandler — Duke Dougal, May 20 '16 at 19:43

JRodDynamite · Accepted Answer · 2023-02-24T08:22:34.017

31

Split the data URI on the comma to get the base64 encoded data without the header. Call base64.b64decode to decode that to bytes. Last, write the bytes to a file.

from base64 import b64decode

data_uri = "data:image/png;base64,iVBORw0KGg..."

# Python 2 and <Python 3.4
header, encoded = data_uri.split("base64,", 1)
data = b64decode(encoded)

# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
#     data = response.read()

with open("image.png", "wb") as f:
    f.write(data)

edited Feb 24 '23 at 08:22

answered Nov 23 '15 at 12:04

JRodDynamite

12,325
5
43
63

2

Just splitting on first comma is not necessarily correct, the MIME may contain comma as well, for example: `data:video/webm; codecs=\"vp8, opus\";base64,GkXfowEAAAAAAAAfQoaBAUL3g...` – Darkyen Aug 11 '21 at 09:48
2

And quotes won't help, because this is also possible: `data:video/webm;codecs=vp8,opus;base64,GkXfo59...` – Darkyen Aug 11 '21 at 10:16

score 25 · Answer 2 · edited Dec 16 '21 at 20:22

25

Python since 3.4 has support for data-uri, under the hood using urllib.request.DataHandler.

from urllib.request import urlopen

with urlopen(data_uri) as response:
    data = response.read()

edited Dec 16 '21 at 20:22

Noelle L.

100
6

answered Oct 30 '19 at 20:29

bl79

1,291
1
15
23

score 11 · Answer 3 · answered Oct 04 '18 at 21:50

11

w3lib (a library used by Scrapy) has a function to parse data uris:

>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('data:image/png;base64,iVBORw0KGg==')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')

answered Oct 04 '18 at 21:50

Mikhail Korobov

21,908
8
73
65

2

the prettiest solution imho: short and produces well-structured result – Andrey Belyak Jan 07 '19 at 12:12

Andrés Pérez-Albela H. · Answer 4 · 2015-11-23T12:31:07.143

This may help:

import re
from lxml import html

BASE_NAME = "image_"

source_code = """<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="data:image/gif;base64,R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=" alt="Black dot" />"""

tree = html.fromstring(source_code)

for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
    image_type, image_content = image.split(',', 1)
    image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
    with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
        f.write(image_content.decode('base64'))
    print "[*] '{}' image found with content: {}\n".format(image_type, image_content)

Output:

[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==

[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=

It will save every base64 image within <img> tags, with their respective file extension:

Prefixed by BASE_NAME + auto-increment digit(s) provided by enumerate + image_extension

score 1 · Answer 5 · answered Feb 19 '17 at 23:55

1

Correcting JRodDynamite's post:

from base64 import decodestring

png_arr= "data:image/png;base64,iVBORw0KGg..."
png_arr = png_arr.split(",")
png_arr = png_arr[1]

fh = open("imageToSave.png", "wb")
fh.write(decodestring(png_arr))
fh.close()

answered Feb 19 '17 at 23:55

Frodo McPytel

19
1

score 0 · Answer 6 · edited Nov 30 '20 at 07:33

0

from urllib import request

def download(data_uri,name):

    with request.urlopen(data_uri) as response:
         data = response.read()

    with open(name, "wb") as f:
        f.write(data)

en="https://encrypted-tbn0.gstatic.com/images..."

src="data:image/png;base64,..."

download(en,"en")

download(src,"src")

edited Nov 30 '20 at 07:33

shreyasm-dev

2,711
5
16
34

answered Nov 29 '20 at 20:51

dimitris katsanos

1

How to parse data-uri in python?

6 Answers6

Linked

Related