How to extract text in python from div tag if other html is within the tag?

Question

I am trying to extract a ref. id from HTML with scrapy:

<div class="col" itemprop="description">
  <p>text Ref.&nbsp;<span>220.20.34.20.53.001</span></p>
  <p>more text</p>
</div>

The span and p tag are not always present.

Using xpath selector:

text = ' '.join(response.xpath('//div[@itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try: 
     ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()

Returns in this case only an empty string, as there is HTML inside the tag.

Now trying to extract the text with CSS selector in order to use remove_tags:

>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])

This returns an empty result as I somehow can not grab the item.

How can I extract the ref_id regardless of having html <p> tags within the div or not. Some items of the crawl have no <p> tag and no <span> where my first attempt with xpath works.

I assume you are aware of the famous [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)? Why aren't you using BeautifulSoup? — roganjosh, Dec 22 '18 at 13:58
No, never saw that one. I thought scrapy does not need beautiful soup as it can handle tasks like this with css and xpath selectors built in? — merlin, Dec 22 '18 at 14:00
Possibly (I'm not overly familiar with the library), but you're quite clearly throwing the regex module into the mix. — roganjosh, Dec 22 '18 at 14:01
yes, once I have the text from inside the div, I will use regex to extract the ref id from the text. This works in 90% of the cases, except the ones where there is the additional tag within the text. — merlin, Dec 22 '18 at 14:02
Which is exactly what that answer I linked is trying to illustrate in the general case; don't use regex on HTML, use a HTML parser :) — roganjosh, Dec 22 '18 at 14:03
Why do you want to extract ID from `div` if you can simply extract it from `span`? — Andersson, Dec 22 '18 at 14:05
You should mention about this in your question and add HTML for both cases (span is present/span is not present) — Andersson, Dec 22 '18 at 14:06

vezunchik · Answer 1 · 2018-12-22T14:06:00.747

1

Try to remove ::text from your last expression:

''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])

But if you need to extract only 220.20.34.20.53.001 from your html, why don't you use response.css('div[itemprop=description] p span::text').extract()?

Or even response.css('div[itemprop=description]').re(r'([\.\d]+)').

edited Dec 22 '18 at 14:06

answered Dec 22 '18 at 14:02

vezunchik

3,669
3
16
25

Thank you! That was what I was looking for. I am not using p span .. as this is only present within 10% of the items and then this would return an empty result. – merlin Dec 22 '18 at 14:04
Your attempt with the regex is interesting, however your regex returns all numbers and there are more numbers present. Trying to adapt this to my regex returns only a dot: response.css('div[itemprop=description]').re(r'Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))') Is this the propper syntax? The regex should be OK. – merlin Dec 22 '18 at 14:10
If you add condition that it should start with number, then it works ok: `response.css('div[itemprop=description]').re(r'(\d[\.\d]+)')` – vezunchik Dec 22 '18 at 14:14
In you regexp you should avoid dependency from `Ref.` string as it is in another tag body. – vezunchik Dec 22 '18 at 14:15
The ref. is the only common element, the id could look totaly different. I got a result now with this command: re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))",remove_tags(response.css('div[itemprop=description]').extract_first()).replace(u'\xa0', u' ').strip())[0] Do you see any improvement that could be done to this query? – merlin Dec 22 '18 at 14:16
In my opinion better to split it to variables and do some extra checks, for example, does `response.css('div[itemprop=description]').get()` returns something or None, or does `re.findall(...)` returns non-empty list and etc. – vezunchik Dec 22 '18 at 14:28

score 1 · Accepted Answer · answered Dec 22 '18 at 14:18

1

You don't need to use the remove_tags as you can get directly the text with the selectors:

sel.css('div[itemprop=description] ::text')

That will get all inner text from the div tag with itemprop="description" and later you can extract your information with a regex:

sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')

answered Dec 22 '18 at 14:18

eLRuLL

18,488
9
73
99

This actually helped me, since the remove tags also removed a separator with removing the
. Could you please explain more about the way you are using the css selector? I figured that this will return one tag per line wher you later on grab the first one.
– merlin Dec 22 '18 at 15:44
The interesting part here is the `space` between the selector and `::text` which tells the selector to get all the `text` from the inner elements, not only the current one (which would be the `div`). Later the [`Parsel`](https://github.com/scrapy/parsel) function `re_first` checks all the elements and gets the first one that matches the specified regex. – eLRuLL Dec 22 '18 at 15:47
Thank you for the clarification. – merlin Dec 22 '18 at 15:51

How to extract text in python from div tag if other html is within the tag?

2 Answers2