You have a quote mismatch. Note carefully the difference between 'text-info'
and "text-info"
in your example string and in your compiled regex. IIRC escaping quotes in raw strings is a bit of a pain in Python (if it's even possible?), but string concatenation sidesteps the issue handily.
area = re.compile(r'</span></li><li>(\d+)<span class='"'"'text-info'"'"'>Área Útil')
Focusing on the quotes, this is concatenating the strings '...class'
, "'"
, 'text-info'
, "'"
, and '>...
. The rule there is that if you want a single quote '
in a single-quote raw string you instead write '"'"'
and try to ignore Turing turning in his grave. I haven't tested the performance, but I think it might behave much like '...class' + "'" + 'text-info' + "'" + '>...
. If that's the case, there is a bunch of copying happening behind the scenes, and that strategy has a quadratic runtime in the number of pieces being concatenated (assuming they're roughly the same size and otherwise generally nice for such an analysis). You'd be better off with nearly any other strategy (such as ''.join(...)
or using triple quoted raw strings r'''...'''
). It might not be a problem though. Benchmark your solution and see if it's good enough before messing with alternatives.
As one of the comments mentioned, you probably want to be parsing the HTML with something more powerful than regex. Regex cannot properly parse arbitrary HTML since it can't parse arbitrarily nested structures. There are plenty of libraries to make the job easier though and handle all of the bracket matching and string munging for you so that you can focus on a high-level description of exactly the data you want. I'm a fan of lxml
. Without putting a ton of time into it, something like the following would be roughly equivalent to what you're doing.
from lxml import html
with open("body2.txt", 'r') as f:
tree = html.fromstring(f.read())
area = tree.xpath("//li[contains(span/text(), 'Área Útil')]/text()")
print(area)
The html.fromstring()
method parses your data as html. The tree.xpath
method uses xpath
syntax to query that parsed tree. Roughly speaking it means the following:
//
Arbitrarily far down in the tree
li
A list node
[*]
Satisfying whatever property is in the square brackets
contains(span/text(), 'Área Útil')
The li
node needs to have a span/text()
node containing the text 'Área Útil'
/text()
We want any text that is an immediate child of the root li
we're describing.
I'm working on a pretty small amount of text here and don't know what your document structure is in the general case. You could add or change any of those properties to better describe the exact document you're parsing. When you inspect an element, any modern browser is able to generate a decent xpath expression to pick out exactly the element you're inspecting. Supposing this snippet came from a larger document I would imagine that functionality would be a time saver for you.