Extract number when scraping

Question

I try to scrape som data from an apartment listing site.

I want to use the price to calculate. So I need to store it as numbers. But it's written like text on the website like this: 5 670 money/month

I want to remove all the characters and spaces, Then make it an integer to save in my db.

I tried regular expression, but get this error.

TypeError: expected string or bytes-like object

This is a element I collect the price from.

<p class="info-price">399&nbsp;euro&nbsp;per&nbsp;month</p>

I get the price with xpath like this

p = response.xpath('//p[@class="info-price"]/text()').extract()

And the output when I collect name of object and price would be like this

{'object': ['North West End 24'], 'price': ['399\xa0euro\xa0per\xa0month']}

How and when should I convert it?

It's several sites and with same result. I get the whole text with price and currency. It looks like this when I scrape. "3 995 kr" or this "249 €/month". When I want to have them "3995" and "249". — sumpen, Jan 04 '21 at 16:40
See this about [html and regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) Then use a html-parser. — wuerfelfreak, Jan 04 '21 at 16:50
@wuerfelfreak I didn't understand that. But I manage to solve it anyway. Thanks for the response! — sumpen, Jan 04 '21 at 18:10
It just means that regex is rarely a good choice for parsing html. But I am happy that you got it to work. **Have a nice day!** — wuerfelfreak, Jan 04 '21 at 18:12

score 0 · Answer 1 · answered Jan 04 '21 at 18:08

So I found a solution. Maybe it's a dirty solution and someone comes along with elegant one-liner.

But as I understand, the text I scrape with this line

 p = response.xpath('//p[@class="info-price"]/text()').extract()

is a list object.

So I add a line to 'convert' it to sa string with this code

p = ''.join(map(str, p))    #Convert to string from list object

And finally to remove all space and text, so I end up with just the price in numbers I use this code

p = re.sub('\D', '', p)     #Remove all but numbers

So all in all this snippet takes the text of the price, convert it to string and then removes all but niumbers.

p = response.xpath('//p[@class="info-price"]/text()').extract()
    p = ''.join(map(str, p))    #Convert to string from list object
    p = re.sub('\D', '', p)     #Remove all but numbers

score 0 · Answer 2 · answered Jan 05 '21 at 08:13

What the .extract() method does is find all occurences of your xpath expression; that's why it returns a list - there might be more than one result. If you know there's only one result or only care about the first one, use .extract_first() instead - it will return the first result as a string (or None, if no match is found), so you don't have to convert the list to a string. (See https://docs.scrapy.org/en/latest/topics/selectors.html#id1)

p = response.xpath('//p[@class="info-price"]/text()').extract_first()

Extract number when scraping

2 Answers2