Get substring with code from different strings

Question

I'm working on a web scraper. Among the fields it scrapes there is a Description tag like this one, different for each product:

<div class="productDescription" style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>

I can get the content of the description tag without problems, but I also need to get the value of REF inside the description (V23T87C88EC for this example).

The problem is this description is always different for all products, HOWEVER there is ALWAYS a "REF.: XXXXXXXXX" substring in there. The length of the REF id can change, and it can be anywhere in the string. What's the best way to always get the REF id?

I'd say a lookbehind regexp would do the trick: https://www.geeksforgeeks.org/python-regex-lookbehind/ : try this pattern r"(?<=REF.: )\w+" — Swifty, Oct 27 '22 at 18:08
You can use `regex` to extract the string that comes after REF.: — Juan C, Oct 27 '22 at 18:08
related: https://stackoverflow.com/questions/8936030/using-beautifulsoup-to-search-html-for-string — Gábor Fekete, Oct 27 '22 at 18:10
Perfect! Thanks @Swifty, that does the trick. I'll accepr your answer if you post it... — Alain, Oct 27 '22 at 18:20

score 1 · Answer 1 · answered Oct 27 '22 at 18:28

Possible solution is the following:

html = """<div class="productDescription" style="overflow: hidden;  
display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>"""

import re

pattern = re.compile(r'REF\.: (.+?)$')

found = pattern.findall(html)

Returns ['V23T87C88EC']

REGEX DEMO

score 1 · Accepted Answer · answered Oct 27 '22 at 19:01

You can do this with a regex (read more about regex: https://docs.python.org/3/howto/regex.html) :

html = '''
<div class="productDescription" style="overflow: hidden; display: block;">
Black Tshirt
<br>
<br>
REF.: V23T87C88EC
<br>
<br>
COMPOSIÇÃO:
<br>
90% Poliamida
</div>
'''

import re

myref = re.search (r"(?<=REF.: )\w+", html)[0]

print(myref)

# V23T87C88EC

Get substring with code from different strings

2 Answers2