I have sucessfully crawled the Headline and the Links.
I would like to replace the Summary tab with The Main Article from the link (Since the Title and Summary are same anyways. )
link = "https://www.vanglaini.org" + article.a['href']
(eg. https://www.vanglaini.org/tualchhung/103834)
Please help me modify my code.
Below is my code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
list_with_headlines = []
list_with_summaries = []
list_with_links = []
for article in soup.find_all('article'):
if article.a is None:
continue
headline = article.a.text.strip()
summary = article.p.text.strip()
link = "https://www.vanglaini.org" + article.a['href']
list_with_headlines.append(headline)
list_with_summaries.append(summary)
list_with_links.append(link)
news_csv = pd.DataFrame({
'Headline': list_with_headlines,
'Summary': list_with_summaries,
'Link' : list_with_links,
})
print(news_csv)
news_csv.to_csv('test.csv')