How to copy information from a web page in python

Question

So I'm trying to make a python script for a thesaurus. I'm a student and will be using it for writing essays, etc to save time when changing words. So far I've been able to open thesaurus.com with my intended search word but I can't seem to figure out how to copy the first 5 returned words and put them in a list then print out.

At this point, I've checked youtube and google. I've also tried searching on stackoverflow but it was of little help so I'm asking for help please.This is what my code looks like:

import webbrowser as wb
import antigravity

word = str(input()).lower()
returned_words_list = []
url = 'https://www.thesaurus.com/browse/{}'.format(word)

wb.open(url, new=2)

I just want it to pprint the returned_words_list to the console at this point. So far I can't even get it to automatically get the words from the website.

You have to scrape. I wrote a tutorial a while back on how to do so: https://www.ankuroh.com/programming/web-scraping-using-python-text-scraping/, there are many other tutorials too, look for "scraping using beautifulsoup" — Ankur Sinha, Jul 28 '19 at 06:28
You can use BautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrap a page and then process it and get exactly what you want. This is very useful if the page has a consistent structure. However you want to check at the site if scrapping is allowed. I will not be very suprised if scrapping is not allowed. However for research purposes it might be allowed. You may want to check with the website. The other possible solution is you may want to check if they have Python APIs to interact with their website. APIs are also a good way to interact. Hope this helped. — Amit, Jul 28 '19 at 06:30
it is called "scraping", ("web scraping", "screen scraping"). To scrape you have to: get HTML from server - modules [requests](https://2.python-requests.org/en/master/), [urllib](https://docs.python.org/3/library/urllib.request.html) - and get data from HTML - modules [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [lxml](https://lxml.de/). If page uses JavaScript then it is not enought because these modules don't run JavaScript. You need [Selenium](https://selenium-python.readthedocs.io/) to control real web browser which runs JavaScript. — furas, Jul 28 '19 at 06:41
Python has also more complex module - [Scrapy](https://scrapy.org/) — furas, Jul 28 '19 at 06:43

QHarr · Answer 1 · 2019-07-28T11:11:23.787

Looking at the web-traffic the page does a request to a different url which returns the results. You can use that endpoint, with a couple of headers, to get all results in json format. Then, looking at this answer by @Martijn Pieters (+ to him), provided you use a generator, you can restrict iterations with islice from itertools. You could of course just slice the full lot from list comprehension as well. Results are returned in descending order of similarity which is particularly useful here as you get the words with the highest similarity scores.

generator

import requests
from itertools import islice

headers = {'Referer':'https://www.thesaurus.com/browse/word','User-Agent' : 'Mozilla/5.0'}
word = str(input()).lower()
r = requests.get('https://tuna.thesaurus.com/relatedWords/{}?limit=6'.format(word), headers = headers).json()

if r['data']:
    synonyms = list(islice((i['term'] for i in r['data'][0]['synonyms']), 5))
    print(synonyms)
else:
    print('No synonyms found')

list comprehension

import requests

headers = {'Referer':'https://www.thesaurus.com/browse/word','User-Agent' : 'Mozilla/5.0'}
word = str(input()).lower()
r = requests.get('https://tuna.thesaurus.com/relatedWords/{}?limit=6'.format(word), headers = headers).json()
if r['data']:
    synonyms = [i['term'] for i in r['data'][0]['synonyms']][:5]
    print(synonyms)
else:
    print('No synonyms found')

score 0 · Answer 2 · answered Jul 28 '19 at 06:56

As comments have mentioned, BeautifulSoup (bs4) is a great library for this. You can use bs4 to parse the entire page, then zone in on the elements you want. First the ul element that contains the words, and then the a elements that hold a word.

import requests
from bs4 import BeautifulSoup

word = "hello"
url = 'https://www.thesaurus.com/browse/{}'.format(word)
r = requests.get(url)
returned_words_list = []

soup = BeautifulSoup(r.text, 'html.parser')
word_ul = soup.find("ul", {"class":'css-1lc0dpe et6tpn80'})
for idx, elem in enumerate(word_ul.findAll("a")):
    returned_words_list.append(elem.text.strip())
    if idx >= 4:
        break

print (returned_words_list)

vladimir · Accepted Answer · 2019-07-28T08:24:13.107

To find the result in markup I would rely on attribute data-linkid:

first way based on using BeautifulSoup

import requests
from bs4 import BeautifulSoup

word = str(input()).lower()
url = 'https://www.thesaurus.com/browse/{}'.format(word)

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
result = soup.select('li > span > a[data-linkid]')[:5]

for link in result:
    print(link.string)

second one based on lxml

import requests
from lxml import etree

word = str(input()).lower()
url = 'https://www.thesaurus.com/browse/{}'.format(word)

response = requests.get(url)
tree = etree.HTML(response.text)
result = tree.xpath('//li/span/a[@data-linkid]')[:5]

for link in result:
    print(link.text)

ps In common the parsing HTML is not the best way in the long term, I would look at free REST services such as http://thesaurus.altervista.org/.

How to copy information from a web page in python

3 Answers3