I am trying to scrape Punjab's school information from this website https://schoolportal.punjab.gov.pk/sed_census/ by going through different districts available on the home page. (e.g. for district Rawalpindi, the html I am scraping is: https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi)
The goal is to create a dataframe with (at least) columns school_name, school_gender, school_level, and location.
Running below -
from bs4 import BeautifulSoup
r = requests.get('https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi')
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('font', {'color':['#333333', '#284775']})[36:]
Each cell of the table on the website gets returned, instead of a row:
[<font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37350153">37350153</a></font>,
<font color="#333333">GGPS BADNIAN</font>,
<font color="#333333">Female</font>,
<font color="#333333">Primary</font>,
<font color="#333333">Badnian</font>,
<font color="#333333"><a href="http://maps.google.com/?ie=UTF8&q=GGPS BADNIAN@33.47595,73.328" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
<font color="#333333"><a href="sch_surrounding.aspx?mauza=Badnian&distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
<font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37320269">37320269</a></font>,
<font color="#284775">GGPS JANDALA</font>,
<font color="#284775">Female</font>,
<font color="#284775">Primary</font>,
<font color="#284775">Potha Sharif</font>,
<font color="#284775"><a href="http://maps.google.com/?ie=UTF8&q=GGPS JANDALA@33.95502,73.50301" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
<font color="#284775"><a href="sch_surrounding.aspx?mauza=Potha Sharif&distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
<font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37310001">37310001</a></font>,
<font color="#333333">GHSS NARA</font>,
<font color="#333333">Male</font>,
<font color="#333333">H.Sec.</font>,
<font color="#333333">Nara</font>,
<font color="#333333"><a href="http://maps.google.com/?ie=UTF8&q=GHSS NARA@33.5401766980066,73.5258855577558" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
<font color="#333333"><a href="sch_surrounding.aspx?mauza=Nara&distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
<font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37310003">37310003</a></font>,
<font color="#284775">GHS HANESAR</font>,
<font color="#284775">Male</font>,
.....
etc...
So the first seven elements with <font color="#333333" ... represent one row of the table on the website, and the next seven elements with <font color="#284775" ... represent the next row of the table on the website, etc.
I am stuck on how to create a dataframe from this in a clean, elegant way.
I've thought about grouping them into 7 elements (as per How to group elements in python by n elements?) but I wonder if there is a more accurate and efficient way to go about.