0

I'm practising some Python scraping and I'm a bit stuck with the following exercise. The aim is to scrape the tickers resulting when applying some filters. Code below:

tickers = []
counter = 1

while True:
    url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="+ str(counter))
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    html = soup(webpage, "html.parser")

    rows = html.select('table[bgcolor="#d3d3d3"] tr')
    for i in rows[1:]:
        a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])
        i = a1
        tickers.append(i)
    counter+=20
    if tickers[-1]==tickers[-2]:
        break

I'm not sure how to extract only 1 column so I'm using the code for all them (a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])), is there a way just to get the first column?

Is there a way to avoid having to hardcode '20' in the script?

When I run the code it creates a duplicate of the last ticker, is there another way to make the code stop when it went through all the entries?

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
Tom
  • 39
  • 4
  • how about just using indexes as you almost do, just `i.find_all('td')[1]` or any other column you want – Matiiss Dec 29 '21 at 21:59

2 Answers2

0

You can use nth-child range to filter out first row in table, then nth-child(2) to get the tickers column within the remaining table rows

tickers = [td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')]

With an existing list use

tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')])

Read about nth-child here:

http://nthmaster.com/

and

https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child


You can stop when there is no more "next" present. counter needs to increment by 20 each request.

import requests
from bs4 import BeautifulSoup as bs

tickers = []
counter = 1

with requests.Session() as s:
    s.headers = {'User-Agent':'Mozilla/5.0'}
    while True:
        # print(counter)
        url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="+ str(counter))
        res = s.get(url)
        html = bs(res.text, "html.parser")
        tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n+2) td:nth-child(2)')])
        
        if html.select_one('.tab-link b:-soup-contains("next")') is None:
            break
        counter+=20
QHarr
  • 83,427
  • 12
  • 54
  • 101
0

So you are only interested in the values of tickers column, select it more specific - Based on its content the <a>:

html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary')

To avoid working with the hardcoded 20 just take a look if there is a next page element and use its href:

html.select_one('.tab-link:-soup-contains("next")')

Example

import requests,time
from bs4 import BeautifulSoup

url = "https://finviz.com/screener.ashx?v=111&f=cap_large"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36','accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
tickers = []

while True:
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.text, "html.parser")

    for a in html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary'):
        tickers.append(a.text)

    if html.select_one('.tab-link:-soup-contains("next")'):
        url = "https://finviz.com/"+html.select_one('.tab-link:-soup-contains("next")')['href']
    else:
        break
    # be kind and add some delay between your requests
    time.sleep(1)

tickers
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • thanks @hedgehog but I get the code above I get the following error: File "", line 14 while url if (url:= html.select_one('.tab-link:-soup-contains("next")')) else False: ^ SyntaxError: invalid syntax – Tom Dec 30 '21 at 09:53
  • Looks like you have an older version of my answer - refresh page an take a look at the example, edited last night, to be more clear. – HedgeHog Dec 30 '21 at 10:01
  • thanks this works perfectly. What's the difference between using a.screener-link-primary vs screener-link-primary? Lastly, what's the difference between using find vs select vs select_one? – Tom Dec 30 '21 at 21:53
  • `a.screener-link-primary` is more specific - Incase that there are other elements with that class. Websites changes from time to time, but changes to structure are rarer than to styles. Therefore it is always a good strategy to use elements or ids instead of classes for the selection. Concerning second question -> [difference between find() and select()](https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select) – HedgeHog Dec 30 '21 at 23:15