-1

I'm working on a web scraper. Among the fields it scrapes there is a Description tag like this one, different for each product:

<div class="productDescription" style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>

I can get the content of the description tag without problems, but I also need to get the value of REF inside the description (V23T87C88EC for this example).

The problem is this description is always different for all products, HOWEVER there is ALWAYS a "REF.: XXXXXXXXX" substring in there. The length of the REF id can change, and it can be anywhere in the string. What's the best way to always get the REF id?

Alain
  • 339
  • 3
  • 19

2 Answers2

1

Possible solution is the following:

html = """<div class="productDescription" style="overflow: hidden;  
display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>"""

import re

pattern = re.compile(r'REF\.: (.+?)$')

found = pattern.findall(html)

Returns ['V23T87C88EC']

REGEX DEMO

gremur
  • 1,645
  • 2
  • 7
  • 20
1

You can do this with a regex (read more about regex: https://docs.python.org/3/howto/regex.html) :

html = '''
<div class="productDescription" style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>
'''

import re

myref = re.search (r"(?<=REF.: )\w+", html)[0]

print(myref)

# V23T87C88EC
Swifty
  • 2,630
  • 2
  • 3
  • 21