3

I have managed to write code to scrape data from the first page and now the I am stuck with writing a loop in this code to scrape the next 'n' pages. Below is the code

I would appreciate if someone could guide/help me to write the code that would scrape the data from remaining pages.

Thanks!

from bs4 import BeautifulSoup
import requests
import csv


url = requests.get('https://wsc.nmbe.ch/search?sFamily=Salticidae&fMt=begin&sGenus=&gMt=begin&sSpecies=&sMt=begin&multiPurpose=slsid&sMulti=&mMt=contain&searchSpec=s').text

soup = BeautifulSoup(url, 'lxml')

elements = soup.find_all('div', style="border-bottom: 1px solid #C0C0C0; padding: 10px 0;")
#print(elements)

csv_file = open('wsc_scrape.csv', 'w')

csv_writer = csv.writer(csv_file)

csv_writer.writerow(['sp_name', 'species_author', 'status', 'family'])


for element in elements:
    sp_name = element.i.text.strip()
    print(sp_name)



    status = element.find('span', class_ = ['success label', 'error label']).text.strip()
    print(status)




    author_family = element.i.next_sibling.strip().split('|')
    species_author = author_family[0].strip()
    family = author_family[1].strip()
    print(species_author)
    print(family)


    print()

    csv_writer.writerow([sp_name, species_author, status, family])

csv_file.close()
Kiran
  • 53
  • 1
  • 6

2 Answers2

2

You have to pass page= parameter in URL and iterate over all pages:

from bs4 import BeautifulSoup
import requests
import csv

csv_file = open('wsc_scrape.csv', 'w', encoding='utf-8')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['sp_name', 'species_author', 'status', 'family'])

for i in range(151):
    url = requests.get('https://wsc.nmbe.ch/search?page={}&sFamily=Salticidae&fMt=begin&sGenus=&gMt=begin&sSpecies=&sMt=begin&multiPurpose=slsid&sMulti=&mMt=contain&searchSpec=s'.format(i+1)).text
    soup = BeautifulSoup(url, 'lxml')
    elements = soup.find_all('div', style="border-bottom: 1px solid #C0C0C0; padding: 10px 0;")
    for element in elements:
        sp_name = element.i.text.strip()
        print(sp_name)
        status = element.find('span', class_ = ['success label', 'error label']).text.strip()
        print(status)
        author_family = element.i.next_sibling.strip().split('|')
        species_author = author_family[0].strip()
        family = author_family[1].strip()
        print(species_author)
        print(family)
        print()
        csv_writer.writerow([sp_name, species_author, status, family])

csv_file.close()
Alderven
  • 7,569
  • 5
  • 26
  • 38
  • it may be obvious but I fail to understand. How does it know that range argument is pertaining to pages? sorry, I am absolutely new to programming. – Kiran Feb 25 '19 at 19:37
  • Thanks! I figured that out but my question is how does it know that it should loop through pages? – Kiran Feb 25 '19 at 21:01
  • 1
    `for i in range(151)` function generates numbers from 0 to 151 and then generated number passed to the url. – Alderven Feb 26 '19 at 04:41
  • Thanks for explaining! – Kiran Feb 26 '19 at 08:55
1

I am not entirely sure of how your descriptions map to on the page but the following shows the principle of the loop and how to extract info.

import requests 
from bs4 import BeautifulSoup as bs
import pandas as pd
n = 4
results = []
headers = ['Success/Failure', 'Names', 'AuthorInfo', 'Family']
df = pd.DataFrame(columns = headers)
with requests.Session() as s:
    for page in range(1,n + 1):
        r = s.get('https://wsc.nmbe.ch/search?sFamily=Salticidae&fMt=begin&sGenus=&gMt=begin&sSpecies=&sMt=begin&multiPurpose=slsid&sMulti=&mMt=contain&searchSpec=s&page={}'.format(page))
        soup = bs(r.content, 'lxml')
        failSucceed = [item.text for item in soup.select('.success, .error')]
        names = [item.text for item in soup.select('.ym-gbox div > i')]
        authorInfo = [item.next_sibling for item in soup.select('.ym-gbox div > i')]
        family= [item.split('|')[1] for item in authorInfo]   
        dfCurrent = pd.DataFrame(list(zip(failSucceed, names, authorInfo, family)))
        df = pd.concat([df, dfCurrent])
df = df.reset_index(drop=True)
df.to_csv(r"C:\Users\User\Desktop\test.csv", encoding='utf-8') 
print(df)

You can get the number of results pages with the following:

numPages = int(soup.select('[href*=search\?page]')[-2].text)
QHarr
  • 83,427
  • 12
  • 54
  • 101