I have an exsiting process that extracts elements from html documents that make use of the xbrli xml standard.
And example of a document can be found here:
The process works well (I'm using multiprocessing to work in parallel) but I have ~20m html and xml files to process and I'm finding beautifulsoup is the core bottleneck.
I am looking at htmlelement as a hopefully quicker alternative to extracting the data I need but I'm struggling to find elements. For example, in BS I can do the following:
for tag in soup.find_all('xbrli:unit'):
l_unitid = tag.attrs.get('id')
l_value = tag.text
l_unit_dict[l_unitid] = {'unitid':l_unitid,'value':l_value}
Which will find all xbrli:unit tags and I can extract their values easily.
However, when I try something similar in htmlelement I get the following exception:
import htmlement
source = htmlement.parse("Prod223_2542_00010416_20190331.html")
for tag in source.iterfind('.//xbrli:unit'):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
SyntaxError: prefix 'xbrli' not found in prefix map
A bit of googling led me to a few articles, but I can't seem to make progress SyntaxError: prefix 'a' not found in prefix map
Parsing XML with namespace in Python via 'ElementTree'
I've tried adding in a namespace map but it's just not finding anything, no matter which way round I put things, or what tags I look for
source = htmlement.parse("Prod223_2542_00010416_20190331.html")
namespaces = {'xbrli': 'period'}
for tag in source.iterfind('.//xbrli:period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
namespaces = {'xbrli': 'period'}
for tag in source.iterfind('.//{xbrli}period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//{xbrli}period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
namespaces = {'period':'xbrli'}
for tag in source.iterfind('.//period',namespaces):
l_unitid = tag.attrs.get('id')
l_value = tag.text
print(l_unitid)
print(l_value)
All return nothing - they don't enter the loop. I've clearly got something very wrong in my understanding of how to use the elementree structure vs BS, but I don't quite know how to move from one to the other.
Any suggestions would be welcome.