I'm trying to extract the x and y values from a string with a scrapy Spider. The string uses a "," or "." for the decimal separation and a (lower- or upper-case) "x" between the x and the y values.
the html looks something like:
<div data-qa="detailId6">
<strong>MARAZZI GRIGIO</strong><br />
Maße: 100X100 cm<br />
<br />
<strong>Naturalia</strong>
</div>
Variations for the dimensions I've seen sofar are:
Maße: 10,5X100,2 cm
Maße: 20,5 X 100,2cm
Maße:10,6 x 90
Maße:10.6 x 90.3cm
I can get the first values quite easily with:
il.add_xpath(
"x_dimension",
'//div[@data-qa="detailId6"]//text()',
re='\d+(?:[,\.]\d+)?(?=\s*[xX]\s*\d+)',
)
but I'm having trouble with the second value. I wanted to use (?<=\d+\s*[xX]\s*)\d+(?:[,.]\d+)?, but this does not work, since the string within (<= ) is of variable length.
Is there a way to get the value in the itemloader, or do I need to write a pipeline?