How can i extract only text in scrapy selector in python

Question

I have this code

   site = hxs.select("//h1[@class='state']")
   log.msg(str(site[0].extract()),level=log.ERROR)

The ouput is

 [scrapy] ERROR: <h1 class="state"><strong>
            1</strong>
            <span> job containing <strong>php</strong> in <strong>region</strong> paying  <strong>$30-40k per year</strong></span>
                </h1>

Is it possible to only get the text without any html tags

score 59 · Accepted Answer · edited Jul 11 '17 at 21:44

59

//h1[@class='state']

in your above xpath you are selecting h1 tag that has class attribute state

so that's why it's selecting everything that comes in h1 element

if you just want to select text of h1 tag all you have to do is

//h1[@class='state']/text()

if you want to select text of h1 tag as well as its children tags, you have to use

//h1[@class='state']//text()

so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags

below mentioned code works for you

site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()

edited Jul 11 '17 at 21:44

budi

6,351
10
55
80

answered Nov 21 '12 at 10:00

akhter wahab

4,045
1
25
47

2

Excellent explanation of the difference of `/text()` and `//text()` – kas Apr 29 '17 at 15:33
`xpath` should never be underestimated – tread Aug 30 '17 at 16:00

score 3 · Answer 2 · answered Nov 21 '12 at 09:22

3

I haven't got a scrapy instance running so I couldn't test this; but you could try to use text() within your search expression.

For example:

site = hxs.select("//h1[@class='state']/text()")

(got it from the tutorial)

answered Nov 21 '12 at 09:22

E.Z.

6,393
11
42
69

score 3 · Answer 3 · answered Dec 30 '15 at 14:57

You can use BeautifulSoup get_text() feature.

from bs4 import BeautifulSoup

text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)

print(soup.get_text())

pm007 · Answer 4 · 2012-11-21T16:26:33.483

2

You can use BeautifulSoup to strip html tags, here is an example:

from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))

You can then strip all the additional whitespaces, new lines etc.

if you don't want to use additional modules, you can try simple regex:

# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))

edited Nov 21 '12 at 16:26

answered Nov 21 '12 at 09:28

pm007

363
1
9

score 0 · Answer 5 · answered Dec 30 '15 at 14:50

0

You can use html2text

import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")

answered Dec 30 '15 at 14:50

Aminah Nuraini

18,120
8
90
108

How can i extract only text in scrapy selector in python

5 Answers5

Linked