-1

I am trying to scrape a file in this site.

https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011

I am looking to download excelsheet with complete directory of towns of TRIPURA. first one in grid list.

my code is :

import requests
import selenium

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}

response = session.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')
soup

And the corresponding element to get our file is given below. how to actually download that particular excel. it will direct to another window where the purpose has to be given and email address. It would be great if you could provide solution to this.

<div class="view-content">
<div class="views-row views-row-1 views-row-odd views-row-first ogpl-grid-list">
<div class="views-field views-field-title"> <span class="field-content"><a href="/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura"><span class="title-content">Complete Town Directory by India/State/District/Sub-District Level, Census 2011 - TRIPURA</span></a></span> </div>
<div class="views-field views-field-field-short-name confirmation-popup-177303 download-confirmation-box file-container excel"> <div class="field-content"><a class="177303 data-extension excel" href="https://data.gov.in/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura" target="_blank" title="excel (Open in new window)">excel</a></div> </div>
<div class="views-field views-field-dms-allowed-operations-3 visual-access"> <span class="field-content">Visual Access: NA</span> </div>
<div class="views-field views-field-field-granularity"> <span class="views-label views-label-field-granularity">Granularity: </span> <div class="field-content">Decadal</div> </div>
<div class="views-field views-field-nothing-1 download-file"> <span class="field-content"><span class="download-filesize">File Size: 44.5 KB</span></span> </div>
<div class="views-field views-field-field-file-download-count"> <span class="field-content download-counts"> Download: 529</span> </div>
<div class="views-field views-field-field-reference-url"> <span class="views-label views-label-field-reference-url">Reference URL: </span> <div class="field-content"><a href="http://www.censusindia.gov.in/2011census/Listofvillagesandtowns.aspx">http://www.censusindia.gov.in/2011census...</a></div> </div>
<div class="views-field views-field-dms-allowed-operations-1 vote_request_data_api"> <span class="field-content"><a class="api-link" href="https://data.gov.in/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura/api" title="View API">Data API</a></span> </div>
<div class="views-field views-field-field-note"> <span class="views-label views-label-field-note">Note: </span> <div class="field-content ogpl-more">NA</div> </div>
<div class="views-field views-field-dms-allowed-operations confirmationpopup-177303 data-export-cont"> <span class="views-label views-label-dms-allowed-operations">EXPORT IN: </span> <span class="field-content"><ul></ul></span> </div> </div>

1 Answers1

0

When you click on the excel link it opens the following page :

https://data.gov.in/node/ID/download

It seems that the ID is the name of the first class of the link eg t.find('a')['class'][0]. Maybe there is a more concise method to get the id but it works as is using the classname

Then the page https://data.gov.in/node/ID/download redirects to the final URL (of the file).

The following is gathering all the URL in a list :

import requests
from bs4 import BeautifulSoup

URL = 'https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011'

src = requests.get(URL)
soup = BeautifulSoup(src.content, 'html.parser')

node_list = [
    t.find('a')['class'][0]
    for t in soup.findAll("div", { "class" : "excel" })
]

url_list = []

for url in node_list:
    node = requests.get("https://data.gov.in/node/{0}/download".format(url))
    soup = BeautifulSoup(node.content, 'html.parser')
    content = soup.find_all("meta")[1]["content"].split("=")[1]
    url_list.append(content)

print(url_list)

Complete code that downloads the files using default filename (using this post) :

import requests
from bs4 import BeautifulSoup
import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        # if no filename was found above, parse it out of the final URL.
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()

URL = 'https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011'

src = requests.get(URL)
soup = BeautifulSoup(src.content, 'html.parser')

node_list = [
    t.find('a')['class'][0]
    for t in soup.findAll("div", { "class" : "excel" })
]

url_list = []

for url in node_list:
    node = requests.get("https://data.gov.in/node/{0}/download".format(url))
    soup = BeautifulSoup(node.content, 'html.parser')
    content = soup.find_all("meta")[1]["content"].split("=")[1]
    url_list.append(content)
    print("download : " + content)
    download(content)
Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159