0

I'm a newbie at Python, and am practicing web scraping by extracting data from a news website.

I currently face 2 problems:

  1. How do I scrape the text, which is represented by a

    tag? It is one of many on the web page. For e.g. the first one is just before the author's name.

  2. The CSV file I exported only contains the headers, but no text. Why? How do I fix this?

Here's the code, many thanks for your help.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from pandas import DataFrame
import csv
import re

f = open ('nprtest1.csv', 'w', encoding='utf8', newline="")
writer = csv.writer(f, delimiter=',')
writer.writerow ('headline', 'date', 'author', 'body' )

*#set the page you want to visit*
url="https://www.npr.org/2019/12/29/792241464/civil-rights-leader-rep-john-lewis-to-start-treatment-for-pancreatic-cancer"
#request page using the request library
page=requests.get(url)


#create soup - parse HTML of webpage
soup=BeautifulSoup(page.content,'html.parser')

headline=soup.find("h1").text
date=soup.find("time").text
body = soup.find_all('p')


### regex to remove tags and other irrelevant bits
date_final = re.sub("\n","",date)

webdata = [headline, date_final, body]

writer.writerow (webdata)
df = pd.read_csv('webscraping_test1.csv')
CTan
  • 95
  • 1
  • 7

1 Answers1

0

For getting the specific text that you want, you can find the element by class. You can find more about that in this answer: How to find elements by class

For the csv problem, since you are already using pandas, I think you'll be better off using panda's to_csv function.