-1

I have a list of different URLs which I would like to scrape the text from using Python. So far I've managed to build a script that returns URLs based on a Google Search with keywords, however I would now like to scrape the content of these URLs. The problem is that I'm now scraping the ENTIRE website including the layout/style info, while I only would like to scrape the 'visible text'. Ultimately, my goal is to scrape for names of all these urls, and store them in a pandas DataFrame. Perhaps even include how often certain names are mentioned, but that is for later. Below is a rather simple start of my code so far:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd

url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]

df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []

# load english language model
nlp = en_core_web_sm.load()

# find Names in text
def spacy_entity(df):    
    df1 = nlp(df)
    df2 = [[w.text,w.label_] for w in df1.ents]
    return df2

for index, url in  df.iterrows():
    print(index)
    print(url)
    sleep(randint(2,5))
    # print(page)
    req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, 'html5lib').get_text()
    df_Names.append(spacy_entity(soup))
df["Names"] = df_Names
CrossLord
  • 574
  • 4
  • 20

1 Answers1

0

For getting the visible text with BeautifoulSoup, there is already this answer: BeautifulSoup Grab Visible Webpage Text

Once you get your visible text, if you want to extract "names" (I'm assuming by names here you mean "nouns"), you can check nltk package (or Blob) on this other answer: Extracting all Nouns from a text file using nltk

Once you apply both, you can ingest your outputs into pandas DataFrame.

Note: Please notice that extracting the visible text given an HTML it is still an open problem. This two papers can highlight the problem way better than I can and they are both using Machine Learning techniques: https://arxiv.org/abs/1801.02607, https://dl.acm.org/doi/abs/10.1145/3366424.3383547. And their respective githubs https://github.com/dalab/web2text, https://github.com/mrjleo/boilernet

purple_lolakos
  • 456
  • 5
  • 15
  • hi there good day dear mazzespazze - (very cool name by the way ;)) many thanks for providing this great answer. Thank you for this quick help ! Great !!! – malaga Feb 09 '21 at 12:05
  • @malaga Good to be of help for anyone :) I guess you are dealing with the same 'problem' somehow? – purple_lolakos Feb 09 '21 at 12:10
  • hi dear mazzespazze great to hear from you: yes i am dealing with some problems - here https://stackoverflow.com/questions/66111803/findall-posts-the-corresponding-threads-in-vbulletin-bs4-scraper-running-over here we have a demo-site of vbulletin: (it is open and we need no registration I need to gather the logic of gathering infos of a certain user xy (we need the logic) . a. gathering the posts threads of a certain author - and besides that. - getting the whole (!) threads a author is involved. This of course includes to go through all the pages (see the attached images). - any idea?! – malaga Feb 09 '21 at 12:16
  • @mazzespazze, thanks for the input! The script in the first link clearly works, and I think I'll be able to make the second script working for my cause too :) – CrossLord Feb 10 '21 at 14:36