Beginner Python Webscraping

Question

I am a beginner in python. I am working on a webscraping project. In the project, i want to look up the meaning and POS of some words from cambridge dictionary and export them into excel.

And this is my code:

pip install bs4
pip install requests
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent" : "xxxxxxx"}
r=requests.get('https://dictionary.cambridge.org/dictionary/english/happy', headers=headers)
soup = BeautifulSoup(r.text,'html.parser')
POS = soup.find_all("span", class_="pos dpos")
print(POS)

result: [<span class="pos dpos" title="A word that describes a noun or pronoun''.>adjective</span>, <span class="pos dpos" title="A word that describes a noun or pronoun.''>adjective</span>]

For the result, I only want to get the word 'adjective'. But I don't know how to do that, is there anyone can help me? Many Thanks.

Welcome @pyt. Please follow this for asking question : https://stackoverflow.com/help/how-to-ask — Devang Sanghani, Feb 07 '22 at 10:32
You can parse the HTML like here: https://stackoverflow.com/questions/11804148/parsing-html-to-get-text-inside-an-element — David, Apr 07 '22 at 12:44

score 0 · Answer 1 · answered May 30 '22 at 13:04

First off: Remove the pip install commands from your script. Installing a library is only required once. Then you can use it by importing it, as you did in line 3 and 4.

You have used the command you're looking for in your code. It is the .text. Store your span inside a variable and then call it by varname.text.

score 0 · Answer 2 · answered Jun 08 '22 at 20:28

Agreeing with the other answer, you should remove the 2 lines:

     pip install bs4
     pip install requests

as they are not needed. Also, your problem is that the variable POS is a list, with 2 "span" tags. What you can do, is iterate through the list, each time printing out the contents. Like this:

    for div in POS: 
        print(div.text)

This should print "adjective" twice, once for each element, if you only want to print it for a specific div, you'll need to access it via index, but you can then call the ".text" again to get the text.

The reason that you're getting a list is because when calling find_all, by a class name you will get a list returned, as class names are not unique to HTML elements.

Hope this helps :)

Beginner Python Webscraping

2 Answers2