0

I am trying to scrape the name of college and addresses from a site : https://www.collegenp.com/2-science-colleges/ , but the problem is that i am only getting the data of first 11 college present in the list and not getting data of others. I had tried everything i know.But none method work.

My code is :

from selenium import webdriver
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import sleep

driver=webdriver.Chrome('C:/Users/acer/Downloads/chromedriver.exe')
driver.get('https://www.collegenp.com/2-science-colleges/')

driver.refresh()
sleep(20)

page=requests.get("https://www.collegenp.com/2-science-colleges/")

college = []
location=[]

soup= BeautifulSoup(page.content,'html.parser')

for a in soup.find_all('div',attrs={'class':'media'}):
  name=a.find('h3',attrs={'class':'college-name'})
  college.append(name.text)
  loc=a.find('span',attrs={'class':'college-address'})
  location.append(loc.text)

df=pd.DataFrame({'College name':college,'Locations':location})
df.to_csv('hell.csv',index=False,encoding='utf-8')

Is there any way so that i can scrape all the data?

Cyber God
  • 1
  • 1

1 Answers1

2

You can use this code to get information from next pages:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.collegenp.com/2-science-colleges/"

headers = {"X-Requested-With": "XMLHttpRequest"}
data = {"state": "on", "action": "filter", "count": "0"}

all_data = []
for page in range(0, 5):  # <-- increase number of pages here
    print("Getting page {}..".format(page))

    data["count"] = page * 10
    soup = BeautifulSoup(
        requests.post(url, data=data, headers=headers).content,
        "html.parser",
    )

    for c in soup.select(".college-name"):
        all_data.append(
            {
                "College name": c.get_text(strip=True),
                "Location": c.find_next(class_="college-address").get_text(
                    strip=True
                ),
            }
        )

df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)

Prints:

                                         College name                  Location
0                     Caspian Valley College,Lalitpur      Kumaripati, Lalitpur
1      Advance Academy and Republica College,Lalitpur      Kumaripati, Lalitpur
2              Araniko International Academy,Lalitpur       Satdobato, Lalitpur
3   Bagiswori Secondary School, Taulachhen, Bhakta...    Chyamhasing, Bhaktapur
4              Bajra Barahi Secondary School,Lalitpur       Chapagaon, Lalitpur
5              Bhanubhakta Memorial College,Kathmandu       Lazimpat, Kathmandu
6                  Damak Model Secondary School,Jhapa              Damak, Jhapa
7                         Damak Multiple Campus,Jhapa              Damak, Jhapa
8                           Einstein Academy,Lalitpur       Thasikhel, Lalitpur
9                   Hari Khetan Multiple Campus,Parsa            Birganj, Parsa
10                       Kankai Adarsha Campus,Morang         Birtamode, Morang
11          Lumbini Adarsh Degree College,Nawalparasi     Kawasoti, Nawalparasi
12            Madhyabindu Multiple Campus,Nawalparasi     Kawasoti, Nawalparasi
13                Marshyangdi Multiple Campus,Lamjung       Besishahar, Lamjung

...

and saves data.csv (screenshot from LibreOffice):

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91