0

can anyone help to import this < https://career.webindia123.com/career/exams/examdate.asp > in a excel sheet using web scraping using python

Not sure what to do. where and what can I copy after insecting the web page. Not sure what to do. What and from where should I copy after inspecting the web page. ________________________________

Link: https://career.webindia123.com/career/exams/examdate.asp

2 Answers2

0

This code will convert from the website into Excel.

Explaining is the one-line comment in the code.

Save as convert-to-excel.py file

import requests
from bs4 import BeautifulSoup
import pandas as pd

# define header of columns
df = pd.DataFrame(columns=['Exam Date','Last Date of Application' ,'Examination','Link'])


# access web site
webpage = "https://career.webindia123.com/career/exams/examdate.asp"
response = requests.get(webpage)

if (response.status_code == 200):
    soup = BeautifulSoup(response.content, "html.parser")
    # Get title <td> tags
    results = soup.findAll("td", {"valign" : "top"})
    for result in results:
        # Get link <a> tag
        item = result.findChildren("a" , recursive=False)
        if (len(item) > 0):
            # Get last date <td> tag
            lastDate = result.find_previous_sibling("td")
            # Get exam date <td> tag
            examDate = lastDate.find_previous_sibling("td")

            # append row data
            df2 = {
                'Exam Date': examDate.text.strip(),
                'Last Date of Application': lastDate.text.strip(),
                'Examination': item[0].attrs['title'],
                'Link': 'https://career.webindia123.com' + item[0].attrs['href']
            }
            df = pd.concat([df, pd.DataFrame.from_records([df2])])

    # print head rows
    print(df.head())

    # print tail rows
    print(df.tail())
    print('Total rows: ' + str(len(df.index)))

    # convert excel
    df.to_excel('results.xlsx', index=False)

Run it

python convert-to-excel.py

Result

enter image description here

enter image description here

References

How to find tags with only certain attributes - BeautifulSoup

How to find children of nodes using BeautifulSoup

Extracting an attribute value with beautifulsoup

bs4.BeautifulSoup.find_previous_sibling

Show text inside the tags BeautifulSoup

How do I get the row count of a Pandas DataFrame?

Bench Vue
  • 5,257
  • 2
  • 10
  • 14
0

This can be done using pandas only.

The data tables have the following attributes {'border':'1', 'width':'100%', 'align':'center', 'cellspacing':'0', 'cellpadding':'2', 'bgcolor':'#E8E8E8', 'bordercolor':'#E8E8E8'}

Using these, we can filter out all the tables from the webpage, clean the data and save it to an excel file.

import pandas as pd

categories = ['MANAGEMENT','ENGINEERING','MEDICAL','LAW','UPSC','SSC','SCIENCE','DESIGN','UGC','BANK TEST','OTHER ENTRANCE EXAMS']

# fetch all tables from the webpage with given attributes
tables = pd.read_html('https://career.webindia123.com/career/exams/examdate.asp', attrs={'border':'1', 'width':'100%', 'align':'center', 'cellspacing':'0', 'cellpadding':'2', 'bgcolor':'#E8E8E8', 'bordercolor':'#E8E8E8'})
columns = ['Exam Date', 'Last Date of Application', 'Examination']

for _ in range(len(tables)):
    tables[_].columns = columns # assign columns to each table
    tables[_] = tables[_][tables[_]['Exam Date'] != 'Completed Applications'] #filter rows which have *Completed Applications* in text, as those are blank rows
    tables[_].insert(len(tables[_].columns), 'Category', categories[_]) #add category to each row

tables = pd.concat(tables, ignore_index=True)
tables.to_csv('Data.csv', index=None)

With this code, I am filtering unneeded rows and assigning the category (of the exam) to the data as well then saving it to a CSV file.

Following is a sample of data from the CSV file. Sample Data

Zero
  • 1,807
  • 1
  • 6
  • 17