I'm a newbie at Python, and am practicing web scraping by extracting data from a news website.
I currently face 2 problems:
- How do I scrape the text, which is represented by a
tag? It is one of many on the web page. For e.g. the first one is just before the author's name.
- The CSV file I exported only contains the headers, but no text. Why? How do I fix this?
Here's the code, many thanks for your help.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pandas import DataFrame
import csv
import re
f = open ('nprtest1.csv', 'w', encoding='utf8', newline="")
writer = csv.writer(f, delimiter=',')
writer.writerow ('headline', 'date', 'author', 'body' )
*#set the page you want to visit*
url="https://www.npr.org/2019/12/29/792241464/civil-rights-leader-rep-john-lewis-to-start-treatment-for-pancreatic-cancer"
#request page using the request library
page=requests.get(url)
#create soup - parse HTML of webpage
soup=BeautifulSoup(page.content,'html.parser')
headline=soup.find("h1").text
date=soup.find("time").text
body = soup.find_all('p')
### regex to remove tags and other irrelevant bits
date_final = re.sub("\n","",date)
webdata = [headline, date_final, body]
writer.writerow (webdata)
df = pd.read_csv('webscraping_test1.csv')