Variable length positive lookbehind RegEx for scrapy itemloader

Question

I'm trying to extract the x and y values from a string with a scrapy Spider. The string uses a "," or "." for the decimal separation and a (lower- or upper-case) "x" between the x and the y values.

the html looks something like:

<div data-qa="detailId6">
    <strong>MARAZZI GRIGIO</strong><br />
    Maße: 100X100 cm<br />
    <br />
    <strong>Naturalia</strong>
</div>

Variations for the dimensions I've seen sofar are:

Maße: 10,5X100,2 cm
Maße: 20,5 X 100,2cm
Maße:10,6 x 90
Maße:10.6 x 90.3cm

I can get the first values quite easily with:

il.add_xpath(
    "x_dimension",
    '//div[@data-qa="detailId6"]//text()',
    re='\d+(?:[,\.]\d+)?(?=\s*[xX]\s*\d+)',
)

but I'm having trouble with the second value. I wanted to use (?<=\d+\s*[xX]\s*)\d+(?:[,.]\d+)?, but this does not work, since the string within (<= ) is of variable length.

Is there a way to get the value in the itemloader, or do I need to write a pipeline?

No need for a lookbehind here as you are not dealing with overlapping matches. Just use a capturing group. I.e. `\d+\s*[xX]\s*(\d+(?:[,.]\d+)?)`. — Wiktor Stribiżew, May 11 '23 at 12:40
Yes, it will work in scrapy because it uses [`parsel.utils.extract_regex`](https://parsel.readthedocs.io/en/latest/_modules/parsel/utils.html) which relies on `re.findall` to extract regex matches (and it only returns captures if capturing group(s) are defined in the pattern). — Wiktor Stribiżew, May 11 '23 at 12:48

Variable length positive lookbehind RegEx for scrapy itemloader

0 Answers0