0

I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.

PROBLEM is that when I scrape CONTENT of the article <p> that content is filled with additional tags like - strong, a etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:

<p> According to <a> Japan's newspapers </a> it happened ... </p>

Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:

enter image description here

I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.

a stone arachnid
  • 1,272
  • 1
  • 15
  • 27
  • Welcome to Stack Overflow! Please don't post your code as an image. It's hard to read, prevents text-based searching, and lowers the overall presentation value of the post. – a stone arachnid Oct 08 '18 at 02:12

2 Answers2

1

Please provide your selector for more detailed help.

Given what you're describing, I'd guess you're selecting p/text() (xml) or p::text (css), which is not going to get the text in the children of <p> elements.

You should try selecting response.xpath('//p/descendant-or-self::*/text()') to get the text in the <p> and all it's children.

You could also just select the <p>, not its text, and you'll get its children as well. From there you can start cleaning up the tags. There are answered questions regarding how to do that.

pwinz
  • 303
  • 2
  • 14
  • Going to read it. Tried the suggested solution above, atm trying to google how to implement it. The selector for the content is: item['content'] = response.xpath('//div[@class="postBody"]/p/text()').extract() – Kamil Liskutin Oct 08 '18 at 02:21
  • Yes please see my edit about how to select text of descendant or self. I believe that will fix you up. – pwinz Oct 08 '18 at 02:23
  • Amazing, thank you so much! Just one question if I may, can I apply the same thing into other selectors [items]. Same things happen in case the website has a link on the author os prominent date ...? – Kamil Liskutin Oct 08 '18 at 02:31
  • You should be able to apply this selector pattern wherever you like, as long as the path is valid you'll get what's there. – pwinz Oct 08 '18 at 02:34
  • Also, another thing how to fix that problem in scrapy - when I have /text() I can just write //text() and it apparently does the same thing. – Kamil Liskutin Oct 08 '18 at 19:59
0

You could use string.replace(,)

new_string = old_string.replace("<a>", "")

You could integrate this into a loop which iterates over a list that contains all of the substrings that you want to discard.