Switching from beautifulsoup to htmlelement - how to find elements

Question

I have an exsiting process that extracts elements from html documents that make use of the xbrli xml standard.

And example of a document can be found here:

The process works well (I'm using multiprocessing to work in parallel) but I have ~20m html and xml files to process and I'm finding beautifulsoup is the core bottleneck.

I am looking at htmlelement as a hopefully quicker alternative to extracting the data I need but I'm struggling to find elements. For example, in BS I can do the following:

for tag in soup.find_all('xbrli:unit'):

    l_unitid = tag.attrs.get('id')
    l_value = tag.text

    l_unit_dict[l_unitid] = {'unitid':l_unitid,'value':l_value}

Which will find all xbrli:unit tags and I can extract their values easily.

However, when I try something similar in htmlelement I get the following exception:

import htmlement

source = htmlement.parse("Prod223_2542_00010416_20190331.html")

for tag in source.iterfind('.//xbrli:unit'):

    l_unitid = tag.attrs.get('id')
    l_value = tag.text

    print(l_unitid)
    print(l_value)

    SyntaxError: prefix 'xbrli' not found in prefix map

A bit of googling led me to a few articles, but I can't seem to make progress SyntaxError: prefix 'a' not found in prefix map

Parsing XML with namespace in Python via 'ElementTree'

I've tried adding in a namespace map but it's just not finding anything, no matter which way round I put things, or what tags I look for

source = htmlement.parse("Prod223_2542_00010416_20190331.html")

namespaces = {'xbrli': 'period'}

for tag in source.iterfind('.//xbrli:period',namespaces):

    l_unitid = tag.attrs.get('id')
    l_value = tag.text

namespaces = {'xbrli': 'period'}

for tag in source.iterfind('.//{xbrli}period',namespaces):

    l_unitid = tag.attrs.get('id')
    l_value = tag.text

    print(l_unitid)
    print(l_value)

namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//{xbrli}period',namespaces):

    l_unitid = tag.attrs.get('id')
    l_value = tag.text

    print(l_unitid)
    print(l_value)

namespaces = {'period':'xbrli'}

for tag in source.iterfind('.//period',namespaces):

    l_unitid = tag.attrs.get('id')
    l_value = tag.text

    print(l_unitid)
    print(l_value)

All return nothing - they don't enter the loop. I've clearly got something very wrong in my understanding of how to use the elementree structure vs BS, but I don't quite know how to move from one to the other.

Any suggestions would be welcome.

I'm a little confused: can you edit your question to show how exactly you obtain `source`? — Jack Fleeting, Apr 16 '20 at 12:38
@JackFleeting sorry yes, I hadn't spotted I'd missed that out. It's in there now — user7863288, Apr 16 '20 at 12:53

score 1 · Accepted Answer · answered Apr 16 '20 at 15:47

Two general comments before I get to a proposed answer: First, you are dealing with an xml document, so it's generally better to use an xml, not html, parser. So that's what I'm using below instead of beautifull soup or htmlelement.

Second, about xbrl generally: from bitter experience (and as many others pointed out), xbrl is terrible. It's shiny on the surface, but once you pop the hood, it's a mess. So I don't envy you...

And, with that said, I tried to approximate what you are likely looking for. I didn't bother to create dictionaries or lists, and just used print() statement. Obviously, if it helps you, you can modify it to your own requirements:

from lxml import etree
import requests
r = requests.get('https://beta.companieshouse.gov.uk/company/00010416/filing-history/MzI1MTU3MzQzMmFkaXF6a2N4/document?format=xhtml&download=1') 

root = etree.fromstring(r.content)
units = root.xpath(".//*[local-name()='unit'][@id]/@id")
for unit in units:
    unit_id = unit
    print('unit: ', unit)
print('----------------------------')

context = root.xpath(".//*[local-name()='context']")
for tag in context:
    id = tag.xpath('./@id')
    print('ID: ',id)
    info = tag.xpath('./*[local-name()="entity"]')
    identifier = info[0].xpath('.//*[local-name()="identifier"]')[0].text
    print('identifier: ',identifier)
    member = info[0].xpath('.//*[local-name()="explicitMember"]')
    if len(member)>0:
        dimension = member[0].attrib['dimension']
        explicitMember = member[0].text
        print('dimension: ',dimension,' explicit member: ',explicitMember)
    periods = tag.xpath('.//*[local-name()="period"]')
    for period in periods:
        for child in period.getchildren():
            if 'instant' in child.tag:
                instant = child.text
                print('instant: ',instant)
            else:
                dates = period.xpath('.//*')
                start_date = dates[0].text
                end_date = dates[1].text
        print('start date: ', start_date,' end date: ',end_date)

    print('===================')

A random sample from the output:

ID:  ['cfwd_31_03_2018']
identifier:  00010416
instant:  2018-03-31
start date:  2017-04-01  end date:  2018-03-31
===================
ID:  ['CountriesHypercube_FY_31_03_2019_Set1']
identifier:  00010416
dimension:  ns15:CountriesRegionsDimension  explicit member:  ns15:EnglandWales
instant:  2018-03-31
start date:  2018-04-01  end date:  2019-03-31

Wow, what a difference that makes. I've taken your example and extended it to replicate what I have already and the numbers are pretty impressive. I did a quick benchmark of 1000 documents stored locally using both methods - BS4 80.09s, LXML 12.62s. So a huge improvement. Thank you ! Also re xbrl - yes, it makes my eyes bleed! — user7863288, Apr 16 '20 at 19:17

Switching from beautifulsoup to htmlelement - how to find elements

1 Answers1