10

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.

Wishing some scholars might be able to help me here scraping all the text from the <body> tag.

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
mmrs151
  • 3,924
  • 2
  • 34
  • 38

2 Answers2

4

Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body>? (assuming it's nested in <html>). It might be even simpler to use the //body selector:

x.select("//body").extract()    # extract body

You can find more information about the selectors Scrapy provides here.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • Thanks Eli, I know that part. But my question was related to get plain text instead of html. Is there any way in scrapy that you know? – mmrs151 Mar 24 '11 at 09:40
  • @mmrs151: try appending `/text()` to the selector. – Eli Bendersky Mar 24 '11 at 11:19
  • 1
    adding /text() will get the text of the body, using //text() will get the text of all sub elements of body. But some of those elements will contain undesirables like script tags. – spazm Jun 09 '12 at 02:25
3

It would be nice to get output like that produced by lynx -nolist -dump, which renders the page and then dumps the visible text. I've gotten close by extracting the text of all children of paragraph elements.

I started with //body//text(), which pulled all the textual elements inside the body, but this included script elements. //body//p gets all of the paragraph elements inside the body, including the implied paragraph tag around untagged text. Extracting the text with //body//p/text() misses elements from subtags (like bold, italic, span, div). //body//p//text() seems to get most of the desired content, as long as the page doesn't have script tags embedded in paragraphs.

in XPath / implies a direct child, while // includes all descendants.

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'&lt;body&gt;',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'&lt;body&gt;',
u"? (assuming it's nested in ",
u'&lt;html&gt;',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

Join the strings together with a space and you have a pretty good output:

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  &lt;body&gt;  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  &lt;body&gt; ? (assuming it's nested in  &lt;html&gt; ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"
spazm
  • 4,399
  • 31
  • 30