0

I am trying to extract a ref. id from HTML with scrapy:

<div class="col" itemprop="description">
  <p>text Ref.&nbsp;<span>220.20.34.20.53.001</span></p>
  <p>more text</p>
</div>

The span and p tag are not always present.

Using xpath selector:

text = ' '.join(response.xpath('//div[@itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try: 
     ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()

Returns in this case only an empty string, as there is HTML inside the tag.

Now trying to extract the text with CSS selector in order to use remove_tags:

>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()]) 

This returns an empty result as I somehow can not grab the item.

How can I extract the ref_id regardless of having html <p> tags within the div or not. Some items of the crawl have no <p> tag and no <span> where my first attempt with xpath works.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
merlin
  • 2,717
  • 3
  • 29
  • 59
  • I assume you are aware of the famous [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)? Why aren't you using BeautifulSoup? – roganjosh Dec 22 '18 at 13:58
  • No, never saw that one. I thought scrapy does not need beautiful soup as it can handle tasks like this with css and xpath selectors built in? – merlin Dec 22 '18 at 14:00
  • Possibly (I'm not overly familiar with the library), but you're quite clearly throwing the regex module into the mix. – roganjosh Dec 22 '18 at 14:01
  • yes, once I have the text from inside the div, I will use regex to extract the ref id from the text. This works in 90% of the cases, except the ones where there is the additional tag within the text. – merlin Dec 22 '18 at 14:02
  • Which is exactly what that answer I linked is trying to illustrate in the general case; don't use regex on HTML, use a HTML parser :) – roganjosh Dec 22 '18 at 14:03
  • Why do you want to extract ID from `div` if you can simply extract it from `span`? – Andersson Dec 22 '18 at 14:05
  • @Andersson the span tag is not always present. – merlin Dec 22 '18 at 14:05
  • You should mention about this in your question and add HTML for both cases (span is present/span is not present) – Andersson Dec 22 '18 at 14:06

2 Answers2

1

Try to remove ::text from your last expression:

''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()]) 

But if you need to extract only 220.20.34.20.53.001 from your html, why don't you use response.css('div[itemprop=description] p span::text').extract()?

Or even response.css('div[itemprop=description]').re(r'([\.\d]+)').

vezunchik
  • 3,669
  • 3
  • 16
  • 25
  • Thank you! That was what I was looking for. I am not using p span .. as this is only present within 10% of the items and then this would return an empty result. – merlin Dec 22 '18 at 14:04
  • Your attempt with the regex is interesting, however your regex returns all numbers and there are more numbers present. Trying to adapt this to my regex returns only a dot: response.css('div[itemprop=description]').re(r'Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))') Is this the propper syntax? The regex should be OK. – merlin Dec 22 '18 at 14:10
  • If you add condition that it should start with number, then it works ok: `response.css('div[itemprop=description]').re(r'(\d[\.\d]+)')` – vezunchik Dec 22 '18 at 14:14
  • In you regexp you should avoid dependency from `Ref.` string as it is in another tag body. – vezunchik Dec 22 '18 at 14:15
  • The ref. is the only common element, the id could look totaly different. I got a result now with this command: re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))",remove_tags(response.css('div[itemprop=description]').extract_first()).replace(u'\xa0', u' ').strip())[0] Do you see any improvement that could be done to this query? – merlin Dec 22 '18 at 14:16
  • In my opinion better to split it to variables and do some extra checks, for example, does `response.css('div[itemprop=description]').get()` returns something or None, or does `re.findall(...)` returns non-empty list and etc. – vezunchik Dec 22 '18 at 14:28
1

You don't need to use the remove_tags as you can get directly the text with the selectors:

sel.css('div[itemprop=description] ::text')

That will get all inner text from the div tag with itemprop="description" and later you can extract your information with a regex:

sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • This actually helped me, since the remove tags also removed a separator with removing the

    . Could you please explain more about the way you are using the css selector? I figured that this will return one tag per line wher you later on grab the first one.

    – merlin Dec 22 '18 at 15:44
  • The interesting part here is the `space` between the selector and `::text` which tells the selector to get all the `text` from the inner elements, not only the current one (which would be the `div`). Later the [`Parsel`](https://github.com/scrapy/parsel) function `re_first` checks all the elements and gets the first one that matches the specified regex. – eLRuLL Dec 22 '18 at 15:47
  • Thank you for the clarification. – merlin Dec 22 '18 at 15:51