Extract a content from `

Question

There is string value of a html tag <a> with CSV content:

href = "data:text/csv;charset=UTF-8,%22csvcontentfollows"

Is there a way to get this CSV content without the such meta headers data:text/csv;charset=UTF-8 by a method of standard modules (requests, lxml, pure python)?

I would not like use manual text parsing (via regexp, index/startswith, split/partition).

UPDATE:

Thanks, I know how to work with html. My question about such meta headers. I re-formulated.

Many answers https://stackoverflow.com/questions/33870538/how-to-parse-data-uri-in-python/58633199 — bl79, Oct 30 '19 at 20:34

dmmfll · Answer 1 · 2019-10-31T11:47:38.910

Here are three possible solutions. The first uses the native Python urllib.request.urlopen function. The second one uses the third-party library lxml. The second one uses the HTMLParser class from the native html.parser module. The second and third ones use a third-party library for parsing data URLs called python-datauri

html_string = """
<a href="data:text/csv;charset=UTF-8,%22csvcontentfollows">
<a href="data:text/csv;charset=UTF-8,%22csvcontentfollows">
<a href="data:text/csv;charset=UTF-8,%22csvcontentfollows">
"""

from contextlib import ExitStack
from urllib.request import urlopen
import lxml.etree

HREF = "href"

tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser())

uris = (
    item.attrib[HREF]
    for item in tree.iterdescendants()
    if HREF in item.attrib
)

with ExitStack() as stack:
    resources = (stack.enter_context(urlopen(uri)) for uri in uris)
    data = [fh.read().decode() for fh in resources]
print(data)

OUTPUT: ['csvcontentfollows', 'csvcontentfollows', 'csvcontentfollows']

import lxml.etree
from datauri import DataURI

tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser())

HREF = "href"

uris = (
    DataURI(item.attrib[HREF])
    for item in tree.iterdescendants()
    if HREF in item.attrib
)
attrs = ("mimetype", "charset", "is_base64", "data")
print([{attr: getattr(uri, attr) for attr in attrs} 
       for uri in uris])

OUTPUT:

[{'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': 'csvcontentfollows'}, {'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': 'csvcontentfollows'}, {'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': 'csvcontentfollows'}]

from html.parser import HTMLParser
from datauri import DataURI

uri_attrs = ("mimetype", "charset", "is_base64", "data")

class MyHTMLParser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.data = []

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for attr, value in attrs:
                if attr == "href":
                    for key, value in attrs:
                        uri = DataURI(value)
                        self.data.append({attr: getattr(uri, attr) for attr in uri_attrs})

parser = MyHTMLParser()
parser.feed(html_string)
print(parser.data)

OUTPUT:

[{'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': 'csvcontentfollows'}, {'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': 'csvcontentfollows'}, {'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': 'csvcontentfollows'}]

I guess I poorly formulated the question. I know how to extract the value of a tag attribute. I wanted to get that content without manual text parsing (via regexp, index / beginswith, split / partition). — bl79, Oct 30 '19 at 02:36
Interesting question. I did some research. It's a data URI so if it is you need a data URI parser. https://en.wikipedia.org/wiki/Data_URI_scheme More on data URLs at MDN web docs: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs Python library for parsing data URIs: https://pypi.org/project/python-datauri/ — dmmfll, Oct 30 '19 at 13:21
I added an update. Thanks for asking this question. I have been using data URLs in image tags as of recent and have been joining strings by hand. I prefer parsers when one exists. I hadn't realized it was a properly defined scheme. — dmmfll, Oct 30 '19 at 13:42
Turn out that native Python has data-uri support in urllib.request. Can also just use `data_uri.split(",", 1)` and there are other ways. - See the link in my comment above. — bl79, Oct 30 '19 at 20:46

Extract a content from `

1 Answers1