Scrapy - Cleaning up text[/p] from nested links[/a] etc

Question

I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.

PROBLEM is that when I scrape CONTENT of the article <p> that content is filled with additional tags like - strong, a etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:

<p> According to <a> Japan's newspapers </a> it happened ... </p>

Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:

I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.

Welcome to Stack Overflow! Please don't post your code as an image. It's hard to read, prevents text-based searching, and lowers the overall presentation value of the post. — a stone arachnid, Oct 08 '18 at 02:12

pwinz · Accepted Answer · 2018-10-08T02:22:16.087

1

Please provide your selector for more detailed help.

Given what you're describing, I'd guess you're selecting p/text() (xml) or p::text (css), which is not going to get the text in the children of <p> elements.

You should try selecting response.xpath('//p/descendant-or-self::*/text()') to get the text in the <p> and all it's children.

You could also just select the <p>, not its text, and you'll get its children as well. From there you can start cleaning up the tags. There are answered questions regarding how to do that.

edited Oct 08 '18 at 02:22

answered Oct 08 '18 at 02:16

pwinz

303
2
14

Going to read it. Tried the suggested solution above, atm trying to google how to implement it. The selector for the content is: item['content'] = response.xpath('//div[@class="postBody"]/p/text()').extract() – Kamil Liskutin Oct 08 '18 at 02:21
Yes please see my edit about how to select text of descendant or self. I believe that will fix you up. – pwinz Oct 08 '18 at 02:23
Amazing, thank you so much! Just one question if I may, can I apply the same thing into other selectors [items]. Same things happen in case the website has a link on the author os prominent date ...? – Kamil Liskutin Oct 08 '18 at 02:31
You should be able to apply this selector pattern wherever you like, as long as the path is valid you'll get what's there. – pwinz Oct 08 '18 at 02:34
Also, another thing how to fix that problem in scrapy - when I have /text() I can just write //text() and it apparently does the same thing. – Kamil Liskutin Oct 08 '18 at 19:59

score 0 · Answer 2 · answered Oct 08 '18 at 02:02

0

You could use string.replace(,)

new_string = old_string.replace("<a>", "")

You could integrate this into a loop which iterates over a list that contains all of the substrings that you want to discard.

answered Oct 08 '18 at 02:02

Thank you for your answer. In the end, it seems that Pwinz solved it. Nevertheless thanks for your effort :) – Kamil Liskutin Oct 08 '18 at 02:34

Scrapy - Cleaning up text[/p] from nested links[/a] etc

2 Answers2