How can I web scrape certain words that don't have an attribute attached to them?

Question

Firstly, I would like to point out that I am very much a beginner to web scraping. I am just beginning a project that scrapes data off of https://coinmarketcap.com. Currently, I am focused on scraping the names of the cryptocurrencies (ie. Bitcoin, Ethereum, Tether, etc.). However, the best I can get is the name of the currency followed by a bunch of formatting such as color, font-size, class, etc. How can I code this so that I can store just the names of the currencies and not have this extra information. Here is my current code:

import requests
from bs4 import BeautifulSoup

#array of just crypto names
names = []

#gets content from site
site = requests.get("https://coinmarketcap.com")

#opens content from site
info = site.content
soup = BeautifulSoup(info,"html.parser")

#class ID for name of crypto
type_name = 'sc-1eb5slv-0 iJjGCS'

#crypto names + other unnecessary info
names_raw = soup.find_all('p', attrs={'class': 'sc-1eb5slv-0 iJjGCS'})

for type_name in names_raw:
    print(type_name.text, type_name.next_sibling)

In case a picture is of more use: my current code

As you can see, I am only 20 lines in but having a pretty tough time figuring this out. I appreciate any help or advice you can give me.

score 1 · Answer 1 · answered Jul 26 '21 at 23:50

To get names and codes of cryptocurrencies from this page, you can use next example:

import requests
from bs4 import BeautifulSoup

url = "https://coinmarketcap.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for td in soup.select("td:nth-of-type(3)"):
    t = " ".join(tag.text for tag in td.select("p, span")).strip()
    print("{:<30} {:<10}".format(*t.rsplit(maxsplit=1)))

Prints:

Bitcoin                        BTC       
Ethereum                       ETH       
Tether                         USDT      
Binance Coin                   BNB       
Cardano                        ADA       
XRP                            XRP       
USD Coin                       USDC      
Dogecoin                       DOGE      
Polkadot                       DOT       
Binance USD                    BUSD      
Uniswap                        UNI       
Bitcoin Cash                   BCH       
Litecoin                       LTC       
Chainlink                      LINK      
Solana                         SOL       
Wrapped Bitcoin                WBTC      
Polygon                        MATIC     
Ethereum Classic               ETC       
Stellar                        XLM       
THETA                          THETA     

...and so on.

Wow, that definitely works! I am a little lost with the loop though. If anyone can help me to understand the usage in the loop, that would be very helpful. I very much appreciate your response Andrej. — C W, Jul 27 '21 at 00:15
@CW `soup.select("td:nth-of-type(3)")` selects the third column in the table. Then in each cell we will find every `
` and `` tags, join them together and split the name and abbreviation. — Andrej Kesely, Jul 27 '21 at 00:21

How can I web scrape certain words that don't have an attribute attached to them?

1 Answers1