0

I am trying to scrape Punjab's school information from this website https://schoolportal.punjab.gov.pk/sed_census/ by going through different districts available on the home page. (e.g. for district Rawalpindi, the html I am scraping is: https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi)

The goal is to create a dataframe with (at least) columns school_name, school_gender, school_level, and location.

Running below -

from bs4 import BeautifulSoup

r = requests.get('https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi')
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('font', {'color':['#333333', '#284775']})[36:]

Each cell of the table on the website gets returned, instead of a row:

[<font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37350153">37350153</a></font>,
 <font color="#333333">GGPS BADNIAN</font>,
 <font color="#333333">Female</font>,
 <font color="#333333">Primary</font>,
 <font color="#333333">Badnian</font>,
 <font color="#333333"><a href="http://maps.google.com/?ie=UTF8&amp;q=GGPS BADNIAN@33.47595,73.328" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
 <font color="#333333"><a href="sch_surrounding.aspx?mauza=Badnian&amp;distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
 <font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37320269">37320269</a></font>,
 <font color="#284775">GGPS JANDALA</font>,
 <font color="#284775">Female</font>,
 <font color="#284775">Primary</font>,
 <font color="#284775">Potha Sharif</font>,
 <font color="#284775"><a href="http://maps.google.com/?ie=UTF8&amp;q=GGPS JANDALA@33.95502,73.50301" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
 <font color="#284775"><a href="sch_surrounding.aspx?mauza=Potha Sharif&amp;distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
 <font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37310001">37310001</a></font>,
 <font color="#333333">GHSS NARA</font>,
 <font color="#333333">Male</font>,
 <font color="#333333">H.Sec.</font>,
 <font color="#333333">Nara</font>,
 <font color="#333333"><a href="http://maps.google.com/?ie=UTF8&amp;q=GHSS NARA@33.5401766980066,73.5258855577558" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
 <font color="#333333"><a href="sch_surrounding.aspx?mauza=Nara&amp;distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
 <font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37310003">37310003</a></font>,
 <font color="#284775">GHS HANESAR</font>,
 <font color="#284775">Male</font>,
.....
etc... 

So the first seven elements with <font color="#333333" ... represent one row of the table on the website, and the next seven elements with <font color="#284775" ... represent the next row of the table on the website, etc.

I am stuck on how to create a dataframe from this in a clean, elegant way.

I've thought about grouping them into 7 elements (as per How to group elements in python by n elements?) but I wonder if there is a more accurate and efficient way to go about.

ejcho
  • 19
  • 4

2 Answers2

0

First thing you need to use id, class, css or xpath selectors (kindly google them) to get elements of the page. The reason for that is to avoid brittle locators. For example in your case anything having that font color will be selected. But lets say if you use this css selector

#main1_grd_emis_details tr

Now only table rows of records are selected. I urge you to google web page element selectors and learning about them before continuing on. Now if you wanna get nth element in this table the above selector can be modified like this in java script. replace n with index starting from 1.

#main1_grd_emis_details tr:nth-child(n)

In beautiful soup nth child selector i think is nth-of-type(n) so above selector would become

#main1_grd_emis_details tr:nth-of-type(n)

And the python code to lets say get 2nd child would be

someRow = soup.select_one("#main1_grd_emis_details tr:nth-of-type(2)")

Now to get each col in a row what you can do is again apply css selector maybe this (i have not tested could be wrong)

"td:nth-of-type(n)"

Scrape what ever is required out of each row like text or href (you can google that too) and put it in a dictionary then add that dictionary to a data frame.

0
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://schoolportal.punjab.gov.pk/sed_census/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
area_urls = ['https://schoolportal.punjab.gov.pk/sed_census/' + href['href'] for href in soup.select('map [href]')]

all_data = []
for u in area_urls:
    print('Getting data from page {} ...'.format(u))
    soup = BeautifulSoup(requests.get(u).content, 'html.parser')
    district = soup.b.text

    for row in soup.select('#main1_grd_emis_details tr:has(td)'):
        tds = [td.get_text(strip=True) for td in row.select('td')]
        all_data.append([district] + tds[:5])

df = pd.DataFrame(all_data, columns='district emiscode school_name school_gender school_level moza'.split())
df.to_csv('data.csv')
print(df)

Prints:

Getting data from page https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=352--Lahore ...
         district  emiscode                                        school_name school_gender school_level           moza
0     352--Lahore  35210532                                   GGPS CHINKOWINDI        Female      Primary    CHINKOWINDI
1     352--Lahore  35210001                    GHSS COMPRESHENSIVE GHORAY SHAH          Male       H.Sec.    Gujjar Pura
2     352--Lahore  35210002           GGHSS SHEIKH SARDAR MUHAMMAD GARHI SHAHU        Female       H.Sec.         lahore
3     352--Lahore  35210003                                    GGHSS SAMANABAD        Female       H.Sec.               
4     352--Lahore  35210004                                        GGHSS BARKI        Female       H.Sec.          Barki
...           ...       ...                                                ...           ...          ...            ...
1213  352--Lahore  35230680  GGPS OUT SIDE BABLIANA (Shifted from Kasur To ...        Female      Primary  NOOR MUHAMMAD
1214  352--Lahore  35211007                             GGPS CHUNGI AMER SIDHU        Female      Primary              0
1215  352--Lahore  35250306                               GPS GOPAL SINGH WALA          Male      Primary    GOPAL SINGH
1216  352--Lahore  35240728                                 GGPS PATTI KASHMIR        Female      Primary  PATTI KASHMIR
1217  352--Lahore  35230678  GGPS WARA JHANDA SINGH (SHIFTED FROM KASUR TO ...        Female      Primary    WARA JHANDA

[1218 rows x 6 columns]

...etc.

and saves data.csv (screenshot from LibreOffice):

enter image description here


EDIT: To get longitude and latitude, you can do:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://schoolportal.punjab.gov.pk/sed_census/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
area_urls = ['https://schoolportal.punjab.gov.pk/sed_census/' + href['href'] for href in soup.select('map [href]')]

all_data = []
for u in area_urls:
    u = 'https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=383--Mianwali'

    print('Getting data from page {} ...'.format(u))
    soup = BeautifulSoup(requests.get(u).content, 'html.parser')
    district = soup.b.text

    for row in soup.select('#main1_grd_emis_details tr:has(td)'):
        tds = [td.get_text(strip=True) for td in row.select('td')]
        a = row.select_one('a[href*="maps.google.com"]')
        lon, lat = a['href'].split('@')[-1].split(',')
        all_data.append([district] + tds[:5] + [lon, lat])

    break

df = pd.DataFrame(all_data, columns='district emiscode school_name school_gender school_level moza lon lat'.split())
df.to_csv('data.csv')
print(df)

Prints:

           district  emiscode                            school_name school_gender school_level                       moza               lon               lat
0     383--Mianwali  38310001                         GHSS TABBI SAR          Male       H.Sec.  Poss Bangi Khela Darmiani  33.1439236085861  71.5508843678981
1     383--Mianwali  38310002                     GHSS KAMAR MUSHANI          Male       H.Sec.                    Sodhari  32.8450116561725  71.3622024469077
2     383--Mianwali  38310003                           GHS ISA KHEL          Male         High                   Isa Khel  32.6850186428055   71.272792853415
3     383--Mianwali  38310004                       GHS KHAGLAN WALA          Male         High                khaglanwala  32.6359399594366  71.2692983541637
4     383--Mianwali  38310005                      GHS KALLOR SHARIF          Male         High                     Kallur  32.7383419219404  71.2667574640363
...             ...       ...                                    ...           ...          ...                        ...               ...               ...
1294  383--Mianwali  38331264                 GPS DERA BALOCHAN WALA          Male      Primary                  Maly wali  32.2964028501883  71.2868203874677
1295  383--Mianwali  38331267  GES DERA MUHAMMAD NAWAZ SULTANAY WALA          Male       Middle                    Harnoli        32.3521257        71.5292018

...
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This is amazing, thank you ever so much! I am also trying to get the location (latitude and longitude) out, but it seems we're only getting an empty string for the that has location info in, from the line "tds = [td.get_text(strip=True) for td in row.select('td')]". So for example, from " – ejcho Aug 31 '20 at 10:37