I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.
PROBLEM is that when I scrape CONTENT of the article <p>
that content is filled with additional tags like - strong
, a
etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:
<p> According to <a> Japan's newspapers </a> it happened ... </p>
Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:
I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.