I'm trying to scrape the data from this web page: https://relatedwords.org/relatedto/sport
I have been able to get it to work locally by manually downloading the web pages, saving them as a .txt
file and then using this code:
def from_file():
search_files = ['sport.txt', 'event.txt']
my_word_list = []
for file in search_files:
with open(file, 'r', errors = 'ignore') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('a', class_ = 'item')
for item in items:
item_split = str(item).find('>') + 1
my_word = str(item)[item_split:-4]
if my_word not in my_word_list:
my_word_list.append(my_word)
To scrape the site I tried lots of different Beautifulsoup
things until I realized the Request
wasn't returning the class = "item"
html elements I am trying to parse. I walked by my code to this point where I could isolate where the problem is:
def from_web():
my_link = 'https://relatedwords.org/relatedto/olympic'
my_page = Request(my_link, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(my_page).read()
print(webpage)
To figure this out I looked here, and several other answers that recommended using request.get()
with either 'html.parser'
or 'html5lib'
as the parser, those solutions did not work.
If someone could point me in the right direction I would appreciate it.
Thank you for the help!