Scraping Page Doesn't Return All HTML

Question

I'm trying to scrape the data from this web page: https://relatedwords.org/relatedto/sport

I have been able to get it to work locally by manually downloading the web pages, saving them as a .txt file and then using this code:

def from_file():
    search_files = ['sport.txt', 'event.txt']
    my_word_list = []
    for file in search_files:
        with open(file, 'r', errors = 'ignore') as f:
            html = f.read()
            soup = BeautifulSoup(html, 'html.parser')
            items = soup.find_all('a', class_ = 'item')
            for item in items:
                item_split = str(item).find('>') + 1
                my_word = str(item)[item_split:-4]
                if my_word not in my_word_list:
                    my_word_list.append(my_word)

To scrape the site I tried lots of different Beautifulsoup things until I realized the Request wasn't returning the class = "item" html elements I am trying to parse. I walked by my code to this point where I could isolate where the problem is:

def from_web():
    my_link = 'https://relatedwords.org/relatedto/olympic'
    my_page = Request(my_link, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(my_page).read()
    print(webpage)

To figure this out I looked here, and several other answers that recommended using request.get() with either 'html.parser' or 'html5lib' as the parser, those solutions did not work.

If someone could point me in the right direction I would appreciate it.

Thank you for the help!

are you were trying to get the `href` such as `/relatedto/athletics/` ? — αԋɱҽԃ αмєяιcαη, Nov 23 '19 at 14:29
idk if my answer is that what you were looking for or not. BTW, you can loop over the href and combine into string as url and parsing under a loop for each url. — αԋɱҽԃ αмєяιcαη, Nov 23 '19 at 14:39
in case if you asking about the reason why it's not returning anything using `requests` . it's because the site running `JavaScript` which loads after the site load. in case if you don't want to use `selenium`, so you can use `dryscrape` — αԋɱҽԃ αмєяιcαη, Nov 23 '19 at 14:42

score 3 · Answer 1 · answered Nov 23 '19 at 14:35

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
url = 'https://relatedwords.org/relatedto/sport'
sada = browser.get(url)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'item'}):
    print(item.text)

Result:

athletics
spectator sport
competition
game
racing
gymnastics
sportsman
soccer
rugby union
association football
downfield
offside
cycling
tennis
polo
team
hockey
football
skating
professional sport
athletic
run
call
referee
kill
spar
judo
ineligible
wipeout
schuss
luge
athletic game
team sport
archery
upfield
contact sport
professional football
funambulism
toboggan
professional baseball
professional basketball
personal foul
bobsled
outdoor sport
skiing
riding
skateboard
speed skate
jackknife
ski
sportswoman
rollerblade
figure skate
rowing
ice skate
roller skate
fun
regulation time
play
physical activity
disport
lark
blood sport
baseball
daisy cutter
boast
mutation
frolic
skylark
romp
gambol
mutant
feature
frisk
coach
rugby
cavort
rollick
volleyball
water ski
sudden death
basketball
run around
lark about
sportive
television
sportsmanship
multisport
champion
chess
badminton
position
sportaccord
diversion
olympic games
council of europe
leisure
recreational
challenge
athlete
athletes
olympic
coaches
games
playing
variation
summercater
entertainment
tournament
season
playoffs
athleticism
dexterity
sleigh
ref
sled
ironman
manager
skate
handler
jog
defense
defence
trial
series
humour
side
english
tuck
humor
save
pack
possession
foul
stroke
shot
equitation
sledding
row
away
aquatics
recreation
backpack
toss
pass
flip
occupation
line
job
business
rappel
sumo
comedy
mountaineer
home
hike
overhand
tramp
disqualified
box
legal
cut
underhanded
dribble
loose
carry
racket
drive
jocularity
punt
kick
submarine
bandy
witticism
down
underhand
drop
hurdle
clowning
snorkel
umpire
surfboard
shoot
surf
field
start
curl
seed
underarm
surge
turn
round
onside
bout
defending
paddle
kayak
average
wit
lead
deficit
timer
canoe
shooter
scull
scout
bob
timekeeper
lacrosse
tradition
goal
biathlon
dodgeball
running
pastime
floorball
soccerplex
skin-dive
one-on-one
jocosity
wittiness
man-to-man
logrolling
waggishness
double-team
spread-eagle
abseil
overhanded
windsurf
overarm
most-valuable
waggery
birling
shadowbox
outclass
offsides
prizefight
sports
motorsport
sportful
sporter
gameday
sportsaholic
nonsports
footballer
outsport
sportless
lusorious
acrobatic
powerlifting
sportlike
rugger
paddlesport
go
sportsplex
gamesome
sporting
pickleball
postseason
professional
passtime
kiteboarding
competitive
slalom
sportsperson
club
birle
competing
sportling
skateboarder
olympics
world
british english
racquet
american english
compete
bowling
competes
competitions
diving
dropkick
sportsfield
clubs
skater
formula
golf
racer
equestrianism
cheerlead
pharaoh
race
minigame
bike
swimming
snowboard
bicycle
tie-breaking methods
championship
motorcycle
gamely
brand
youth
nascar
iran
federation
model
ever
f1
uci
teams
puck
track
wrestling
racquetball
competitor
riders
cricket
postgame
subbuteo
enthusiasts
trashsport
popular
sports league
super
championships
powerboating
jousting
racers
class
sponsorship
event
netball
friendly
softball
models
driving
best
women
amateur
good
association
experience
peloponnese
car
venue
players
well
roller
for
fia
pigskin
motocross
competed
fit
standards
leagues
drivers
european
national
tour
fitness
cars
esports
transgender
wogball
game plan
free agent
press box
professional boxing
professional wrestling
horseback riding
bench warmer
water sport
track and field
out of play
professional golf
talent scout
ski jump
professional tennis
gymnastic exercise
line of work
field sport
follow through
defending team
free agency
at home
rock climbing
won-lost record
sit out
warm the bench
rope down
iron man
tightrope walking
ride the bench
mind sport
bucketball
tennikoit
association of ioc recognised international sports federations
snowsport
gambling
individual sport
sport game
nongame
contract bridge
gamification
extreme sport
water polo
subgame
metagaming
nongamer
vacationer
gameplayer
basket ball
regulation of sport
rioting
gaymer
sportsbook
hooliganism
table tennis
zourkhaneh
gameography
sports journalism
rough sport
watersport
skibobbing
ping pong
roller hockey
sport venue
fanwear
violent sport
sport card
education
broadcasting of sports events
real tennis
olympic sport
outdoor game
cross-country
soccerball
snow ski
wintersports
field hockey
sports betting
woodball
telegaming
bench warm
old french
devise structure activity
concussion
combat sport
canadian football
blow football
four square
disability
good sport
winter sport
drop shoot
fun game
hockey skate
australian rule football
ball sport
recreational activity
internet
olympic game
professional wrestle
field game
formula 1
bicycle race
ballpark frank
cue sports
equestrian sport
horse ride
video game
ice hockey
sport stack
dangerous sport
fun sport
ball hawk
table game
wage
salary
soccer player
hat tournament
sports day
ice dance
court game
baseball stadium
fun run
e sport
popular sport
table football
indoors
sport in china
pleasure boat
boxing
ball game
mind game
ball carrier
ancient egypt
pay-per-view
pleasure craft
clay pigeon
pink un
green un
synchronize dive
ancient persia
child's game
american sport
goal kick
guess game
one on one
own goal
place kick
fina
in line skate
tenpin bowl
sport bat
ancient greece
electronic game
olympia, greece
parlour game
mouse wheel
game system
treasure hunt
sportsmen
get hyphy
leisure time
mass media
overtaking
gender identity
grantland rice
nonresident
taekwondo
doping
pierre de coubertin
bobsleigh
player
match fixing
curling
discipline
sportswear
sporty
gym
blood doping
jock
enthusiast
violence in sports
adventure
champ
bundesliga
pleasure
child development
gambler
physical
wear
physical fitness
ballgame
movement
nfl
treadmill
athletic scholarship
exercise
mockery
activity
mate
boy
war on drugs
fanatic
campaign
buddy
participation inequality
movements
son
jeux
illegal drug trade
suv
yuk
hong
pal
willy
physical disability
sportif
suvs
kang
apartheid
ireland
hurling
nationalism
stadium
italy
berlin
sportswomen
intellectual disabilities
radio broadcasting
sportfishing
fifa world cup
2006 fifa world cup
fifa world cup finals
2011 cricket world cup final
united states
national football league
super bowl
hawk-eye
sportspeople
snickometer
garmisch-partenkirchen
untermensch
amateur sport
sports science
running shoe
competitive swimwear
sports engineering
wearable technology
hybrid vehicle
formula one
2014 formula one season
drag reduction system
aerodynamic drag
top speed
formula renault 3.5
deutsche tourenwagen masters
goal-line technology
2014 fifa world cup
2015 fifa women's world cup
premier league
2013�14 premier league
2015�16 bundesliga
rugby league
third umpire
umpire decision review system
international cricket council
1934 fifa world cup
adolf hitler
1936 summer olympics
1936 winter olympics
hot spot
nazi ideology
aryan race
redskins rule
south africa
cultural nationalism
gaelic football
benito mussolini
gaelic athletic association
great britain
croke park
lansdowne road
aviva stadium
royal ulster constabulary
good friday agreement
football war
munich massacre
proceedings of the national academy of sciences
washington redskins

score 1 · Accepted Answer · answered Nov 23 '19 at 15:03

1

No need to use Selenium at all, you can get the data from their API. It also returns a 'score' as well

import requests

url = 'https://relatedwords.org/api/related'

term = 'sport'
payload = {
'term':term}

jsonData = requests.get(url, params=payload).json()


for each in jsonData:
    print (each['word'])

answered Nov 23 '19 at 15:03

chitown88

27,527
4
30
59

2

Ops, just noticed now they are using `API` under `GET` request for network tab. well done chito – αԋɱҽԃ αмєяιcαη Nov 23 '19 at 15:11
2

thanks. Glad you posted the selenium solution too. It's always good to see multiple solutions/approaches. – chitown88 Nov 23 '19 at 15:13

Scraping Page Doesn't Return All HTML

2 Answers2