Python Beautifulsoup scraping script unpacking, hardcoding and duplication

Question

I'm practising some Python scraping and I'm a bit stuck with the following exercise. The aim is to scrape the tickers resulting when applying some filters. Code below:

tickers = []
counter = 1

while True:
    url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="+ str(counter))
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    html = soup(webpage, "html.parser")

    rows = html.select('table[bgcolor="#d3d3d3"] tr')
    for i in rows[1:]:
        a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])
        i = a1
        tickers.append(i)
    counter+=20
    if tickers[-1]==tickers[-2]:
        break

I'm not sure how to extract only 1 column so I'm using the code for all them (a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])), is there a way just to get the first column?

Is there a way to avoid having to hardcode '20' in the script?

When I run the code it creates a duplicate of the last ticker, is there another way to make the code stop when it went through all the entries?

how about just using indexes as you almost do, just `i.find_all('td')[1]` or any other column you want — Matiiss, Dec 29 '21 at 21:59

QHarr · Answer 1 · 2021-12-29T23:49:36.937

You can use nth-child range to filter out first row in table, then nth-child(2) to get the tickers column within the remaining table rows

tickers = [td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')]

With an existing list use

tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')])

Read about nth-child here:

http://nthmaster.com/

and

https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child

You can stop when there is no more "next" present. counter needs to increment by 20 each request.

import requests
from bs4 import BeautifulSoup as bs

tickers = []
counter = 1

with requests.Session() as s:
    s.headers = {'User-Agent':'Mozilla/5.0'}
    while True:
        # print(counter)
        url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="+ str(counter))
        res = s.get(url)
        html = bs(res.text, "html.parser")
        tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')])
        
        if html.select_one('.tab-link b:-soup-contains("next")') is None:
            break
        counter+=20

HedgeHog · Accepted Answer · 2021-12-30T09:56:39.967

0

So you are only interested in the values of tickers column, select it more specific - Based on its content the <a>:

html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary')

To avoid working with the hardcoded 20 just take a look if there is a next page element and use its href:

html.select_one('.tab-link:-soup-contains("next")')

Example

import requests,time
from bs4 import BeautifulSoup

url = "https://finviz.com/screener.ashx?v=111&f=cap_large"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36','accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
tickers = []

while True:
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.text, "html.parser")

    for a in html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary'):
        tickers.append(a.text)

    if html.select_one('.tab-link:-soup-contains("next")'):
        url = "https://finviz.com/"+html.select_one('.tab-link:-soup-contains("next")')['href']
    else:
        break
    # be kind and add some delay between your requests
    time.sleep(1)

tickers

edited Dec 30 '21 at 09:56

answered Dec 29 '21 at 23:34

HedgeHog

22,146
4
14
36

thanks @hedgehog but I get the code above I get the following error: File "", line 14 while url if (url:= html.select_one('.tab-link:-soup-contains("next")')) else False: ^ SyntaxError: invalid syntax – Tom Dec 30 '21 at 09:53
Looks like you have an older version of my answer - refresh page an take a look at the example, edited last night, to be more clear. – HedgeHog Dec 30 '21 at 10:01
thanks this works perfectly. What's the difference between using a.screener-link-primary vs screener-link-primary? Lastly, what's the difference between using find vs select vs select_one? – Tom Dec 30 '21 at 21:53
`a.screener-link-primary` is more specific - Incase that there are other elements with that class. Websites changes from time to time, but changes to structure are rarer than to styles. Therefore it is always a good strategy to use elements or ids instead of classes for the selection. Concerning second question -> [difference between find() and select()](https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select) – HedgeHog Dec 30 '21 at 23:15

Python Beautifulsoup scraping script unpacking, hardcoding and duplication

2 Answers2

Example