I have a list of different URLs which I would like to scrape the text from using Python. So far I've managed to build a script that returns URLs based on a Google Search with keywords, however I would now like to scrape the content of these URLs. The problem is that I'm now scraping the ENTIRE website including the layout/style info, while I only would like to scrape the 'visible text'. Ultimately, my goal is to scrape for names of all these urls, and store them in a pandas DataFrame. Perhaps even include how often certain names are mentioned, but that is for later. Below is a rather simple start of my code so far:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd
url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]
df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []
# load english language model
nlp = en_core_web_sm.load()
# find Names in text
def spacy_entity(df):
df1 = nlp(df)
df2 = [[w.text,w.label_] for w in df1.ents]
return df2
for index, url in df.iterrows():
print(index)
print(url)
sleep(randint(2,5))
# print(page)
req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html5lib').get_text()
df_Names.append(spacy_entity(soup))
df["Names"] = df_Names