2

import requests from bs4 import BeautifulSoup as bs import csv

r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000') soup = bs(r.text) table = soup.find_all(class_='ag-header-cell-text')

this give me None value any idea how to scrape data from this site would appreciate.

Sajjad Ali
  • 33
  • 6

2 Answers2

1

The tag you're searching for isn't in the source code, which is why you're returning no data. Is there some reason you expect this to be there? You may be seeing different source code in a browser than you do when pulling it with the requests library.

You can view the code being pulled via:

    import requests
    from bs4 import BeautifulSoup as bs
    import csv

    r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
    soup = bs(r.text, "lxml")
    print( soup )
  • yeah, I was looking in inspect code and it gives me what i wrote in code but you are right the code which i have in jupyter is different don't know how and I am very new to scraping still learning. Thanks – Sajjad Ali Mar 31 '21 at 16:46
  • Likely there is some JavaScript executing in your browser to generate this HTML. The answer below has more information on how to work around that. – Justin Bodnar Mar 31 '21 at 16:50
1

BeautifulSoup can only see what's directly baked into the HTML of a resource at the time it is initially requested. The content you're trying to scrape isn't baked into the page, because normally, when you view this particular page in a browser, the DOM is populated asynchronously using JavaScript. Fortunately, logging your browser's network traffic reveals requests to a REST API, which serves the contents of the table as JSON. The following script makes an HTTP GET request to that API, given a desired "dataset_id" (you can change the key-value pair in the params dict as desired). The response is then dumped into a CSV file:

def main():
    import requests
    import csv

    url = "https://portal.karandaaz.com.pk/api/table"

    params = {
        "dataset_id": "1000"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    content = response.json()

    filename = "dataset_{}.csv".format(params["dataset_id"])

    with open(filename, "w", newline="") as file:
        fieldnames = content["data"]["columns"]

        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()

        for row in content["data"]["rows"]:
            writer.writerow(dict(zip(fieldnames, row)))
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())
Paul M.
  • 10,481
  • 2
  • 9
  • 15
  • Thanks Paul. your code is seem to me very complicated, it gives me an error of "No module named 'env" any idea? – Sajjad Ali Mar 31 '21 at 16:50
  • @SajjadAli Sorry about that, I made a mistake copy-pasting my code. The first two lines should not have been there. Refresh the page and try my updated code. – Paul M. Mar 31 '21 at 17:01
  • Thanks Paul. You have saved my another day trying to find the solution. Stackoverflow just rock i just sign up and put the question and got the answer what i was looking for.BTW how did you find out about the other url? it goes over my head. – Sajjad Ali Mar 31 '21 at 17:06
  • 1
    @SajjadAli You're welcome. About the URL, you may want to read [this answer](https://stackoverflow.com/questions/65585597/how-to-click-a-link-by-text-with-no-text-in-python/65585861#65585861) I posted for a different question, where someone was trying to scrape information about wines and vineyards. In it, I explain the steps you need to take to log your browser's network traffic, and how to formulate requests to an API. – Paul M. Mar 31 '21 at 18:24