I aim to scrape the 100 news texts using BeautifulSoup and for-loop, and store the texts into the list myarticle. I expect myarticle should only contain the content of the news articles, which I find all have h attribute. However, the result I got contain many irrelevant part, such as : "Thanks for contacting us. We've received your submission." and "This story has been shared 205,105 times. 205,105" and so on.
Another issue is, when I print(myarticle[0]), it gives me many news articles, but I expect it should only give me 1 article.
I would like to know how could I remove the irrelevant part and only keep the main content as we read from the news web. And how could I adjust the code so that when I print(myarticle[0]), it gives me the first news article.
One of the 100 news articles is on this page: https://nypost.com/2020/04/21/missouri-sues-china-over-coronavirus-deceit/
Other news articles I want to scrape are on this site: https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance
Below are the lines of code relevant to my question.
for pagelink in pagelinks:
#get page text
page = requests.get(pagelink)
#parse with BeautifulSoup
soup = bs(page.text, 'lxml')
articletext = soup.find_all('p')
for paragraph in articletext[:-1]:
#get the text only
text = paragraph.get_text()
paragraphtext.append(text)
#combine all paragraphs into an article
thearticle.append(paragraphtext)
# join paragraphs to re-create the article
myarticle = [''.join(article) for article in thearticle]
#show the first string of the list
print(myarticle[0])