0

I want to get a specific digit based on the right string.

This stretch of string is in body2.txt

string = "<li>3 <span class='text-info'>quartos</span></li><li>1 <span class='text-info'>suíte</span></li><li>96<span class='text-info'>Área Útil (m²)</span></li>"

with open("body2.txt", 'r') as f:
  area = re.compile(r'</span></li><li>(\d+)<span class="text-info">Área Útil')
  area = area.findall(f.read())
  print(area)

output: [] expected output: 96

rjunior8
  • 23
  • 6

2 Answers2

0

You have a quote mismatch. Note carefully the difference between 'text-info' and "text-info" in your example string and in your compiled regex. IIRC escaping quotes in raw strings is a bit of a pain in Python (if it's even possible?), but string concatenation sidesteps the issue handily.

area = re.compile(r'</span></li><li>(\d+)<span class='"'"'text-info'"'"'>Área Útil')

Focusing on the quotes, this is concatenating the strings '...class', "'", 'text-info', "'", and '>.... The rule there is that if you want a single quote ' in a single-quote raw string you instead write '"'"' and try to ignore Turing turning in his grave. I haven't tested the performance, but I think it might behave much like '...class' + "'" + 'text-info' + "'" + '>.... If that's the case, there is a bunch of copying happening behind the scenes, and that strategy has a quadratic runtime in the number of pieces being concatenated (assuming they're roughly the same size and otherwise generally nice for such an analysis). You'd be better off with nearly any other strategy (such as ''.join(...) or using triple quoted raw strings r'''...'''). It might not be a problem though. Benchmark your solution and see if it's good enough before messing with alternatives.

As one of the comments mentioned, you probably want to be parsing the HTML with something more powerful than regex. Regex cannot properly parse arbitrary HTML since it can't parse arbitrarily nested structures. There are plenty of libraries to make the job easier though and handle all of the bracket matching and string munging for you so that you can focus on a high-level description of exactly the data you want. I'm a fan of lxml. Without putting a ton of time into it, something like the following would be roughly equivalent to what you're doing.

from lxml import html

with open("body2.txt", 'r') as f:
    tree = html.fromstring(f.read())
area = tree.xpath("//li[contains(span/text(), 'Área Útil')]/text()")
print(area)

The html.fromstring() method parses your data as html. The tree.xpath method uses xpath syntax to query that parsed tree. Roughly speaking it means the following:

  1. // Arbitrarily far down in the tree
  2. li A list node
  3. [*] Satisfying whatever property is in the square brackets
  4. contains(span/text(), 'Área Útil') The li node needs to have a span/text() node containing the text 'Área Útil'
  5. /text() We want any text that is an immediate child of the root li we're describing.

I'm working on a pretty small amount of text here and don't know what your document structure is in the general case. You could add or change any of those properties to better describe the exact document you're parsing. When you inspect an element, any modern browser is able to generate a decent xpath expression to pick out exactly the element you're inspecting. Supposing this snippet came from a larger document I would imagine that functionality would be a time saver for you.

Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37
0

This will get the right digits no matter how / what form the target is in.
Capture group 1 contains the digits.

r"(\d*)\s*<span(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\sclass\s*=\s*(?:(['\"])\s*text-info\s*\2))\s+(?=((?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+>))\3\s*Área\s+Útil"

https://regex101.com/r/pMATkj/1