0

I wrote a code to get the following value "Exam Code", "Exam Name" and "Total Question". The issue is that in the put CSV file I am getting the wrong value in the "Exam Code" column. I am getting the same value as "Exam Name". The xPath looks fine to me. I don't know where is the issue happening. Following is the code:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time

option = Options()
option.add_argument("--disable-infobars")
option.add_argument("start-maximized")
option.add_argument("--disable-extensions")
option.add_experimental_option("excludeSwitches", ['enable-automation'])

# Pass the argument 1 to allow and 2 to block
# option.add_experimental_option("prefs", {
#     "profile.default_content_setting_values.notifications": 1
# })
driver = webdriver.Chrome(chrome_options=option, executable_path='C:\\Users\\Awais\\Desktop\\web crawling\\chromedriver.exe')

url = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]

driver.implicitly_wait(0.5)

na = "N/A"

# text = 'Note: This exam is available on Demand only. You can Pre-Order this Exam and we will arrange this for you.'
links = []
exam_code = []
exam_name = []
total_q = []

for items in range(0, 5):
    driver.get(url[items])
    # if driver.find_element_by_xpath("//div[contains(@class, 'alert') and contains(@class, 'alert-danger')]") == text:
    #     continue
    items += 1

    try:
        c_url = driver.current_url
        links.append(c_url)
    except:
        pass

    try:
        codes = driver.find_element_by_xpath('''//div[contains(@class, 'col-sm-6') and contains(@class, 'exam-row-data') and position() = 2]''')
        exam_code.append(codes.text)
    except:
        exam_code.append(na)

    try:
        names = driver.find_element_by_xpath('//*[@id="content"]/div/div[1]/div[2]/div[3]/div[2]/a')
        exam_name.append(names.text)
    except:
        exam_name.append(na)

    try:
        question = driver.find_element_by_xpath('//*[@id="content"]/div/div[1]/div[2]/div[4]/div[2]/strong')
        total_q.append(question.text)
    except:
        total_q.append(na)
    continue


all_info = list(zip(links, exam_name, exam_name, total_q))
print(all_info)

df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
driver.close()
petezurich
  • 9,280
  • 9
  • 43
  • 57
  • My advise is to open the browser console and try to get via javascript the element. Then port the xpath selector to python – lsabi Mar 06 '20 at 08:19
  • @lsabi thanks for your quick reply. Can you guide me on how can I get XPath like this? –  Mar 06 '20 at 08:20
  • Open the browser on the page you are scraping. Use the console to interact with it. https://developer.mozilla.org/en-US/docs/Web/XPath/Introduction_to_using_XPath_in_JavaScript – lsabi Mar 06 '20 at 08:22
  • Here's an example how to use it https://stackoverflow.com/questions/10596417/is-there-a-way-to-get-element-by-xpath-using-javascript-in-selenium-webdriver Basically, copy and paste your xpaths in javascript and see if the output matches your expectations – lsabi Mar 06 '20 at 08:24
  • @lsabi unfortunately I don't have a JS knowledge. –  Mar 06 '20 at 08:25
  • Open the console: https://kb.mailster.co/how-can-i-open-the-browsers-console/ You should see the "console" tab. Click it and there you can type javascript code. Copy and paste the code of the stackoverflow link above and adapt it to your xpath – lsabi Mar 06 '20 at 08:26
  • 1
    Hi @Awais as per my understanding the URL's you have mentioned has the Exam code already you can just capture that from string – Prakhar Jhudele Mar 06 '20 at 09:22
  • Why do you need selenium? the source code contains the info you need without using javascript. – Pedro Lobito Mar 06 '20 at 10:09
  • Also, most pages redirect to https://www.marks4sure.com/200-301-exam.html, so you'll get the same results. Only https://www.marks4sure.com/300-470-exam.html don't. – Pedro Lobito Mar 06 '20 at 10:30

3 Answers3

1

Hi to get the exam code I think it is better to work with regex and get it from URL itself. Also below code gives me the exam codes correctly except for 4th link which has a different structure as compared to others.

# -*- coding: utf-8 -*-
"""
Created on Fri Mar  6 14:48:00 2020

@author: prakh
"""

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time

option = Options()
option.add_argument("--disable-infobars")
option.add_argument("start-maximized")
option.add_argument("--disable-extensions")
option.add_experimental_option("excludeSwitches", ['enable-automation'])

# Pass the argument 1 to allow and 2 to block
# option.add_experimental_option("prefs", {
#     "profile.default_content_setting_values.notifications": 1
# })
driver = webdriver.Chrome(executable_path='C:/Users/prakh/Documents/PythonScripts/chromedriver.exe') 

url = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]

driver.implicitly_wait(0.5)

na = "N/A"

# text = 'Note: This exam is available on Demand only. You can Pre-Order this Exam and we will arrange this for you.'
links = []
exam_code = []
exam_name = []
total_q = []

for items in range(0, 5):
    driver.get(url[items])
    # if driver.find_element_by_xpath("//div[contains(@class, 'alert') and contains(@class, 'alert-danger')]") == text:
    #     continue
    items += 1

    try:
        c_url = driver.current_url
        links.append(c_url)
    except:
        pass

    try:
        codes = driver.find_element_by_xpath('//*[@id="content"]/div/div[1]/div[2]/div[2]/div[2]')
        exam_code.append(codes.text)
    except:
        exam_code.append(na)

    try:
        names = driver.find_element_by_xpath('//*[@id="content"]/div/div[1]/div[2]/div[3]/div[2]/a')
        exam_name.append(names.text)
    except:
        exam_name.append(na)

    try:
        question = driver.find_element_by_xpath('//*[@id="content"]/div/div[1]/div[2]/div[4]/div[2]/strong')
        total_q.append(question.text)
    except:
        total_q.append(na)
    continue


all_info = list(zip(links, exam_code, exam_name, total_q))
print(all_info)

df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
driver.close()
Prakhar Jhudele
  • 955
  • 1
  • 7
  • 14
1

You are getting the exam name in there twice, and instead of exam codes because that's what you are telling it to do (minor typo here with having exam_name in there twice):

all_info = list(zip(links, exam_name, exam_name, total_q))

change to: all_info = list(zip(links, exam_code, exam_name, total_q))

Few things I'm confused about.

1) Why use Selnium? There is no need for selenium as the data is returned in the initial request in the html source. So I would just use requests as it would speed up the processing.

2) The link and the exam code are already in the url you are iterating through. I would just split or use regex to that string to get the link and the code. You only really need to get the exam name and number of questions then.

With that being said, I adjusted it slightly to just get exam name and number of questions:

import requests
from bs4 import BeautifulSoup
import pandas as pd


urls = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]

links = []
exam_code = []
exam_name = []
total_q = []

for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    links.append(url)
    exam_code.append(url.rsplit('-exam')[0].split('/')[-1])

    exam_row = soup.select('div[class*="exam-row-data"]')
    for exam in exam_row:


        if exam.text == 'Exam Name: ':
            exam_name.append(exam.find_next_sibling("div").text)
            continue

        if 'Questions' in exam.text and 'Total Questions' not in exam.text:
            total_q.append(exam.text.strip())
            continue

all_info = list(zip(links, exam_code, exam_name, total_q))
print(all_info)

df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • thanks bro, that fixed the problem. And thanks again for writing the writing in bs and requests. Actually I just started learning to requests and bs so that's why I use selenium. –  Mar 06 '20 at 10:38
  • ya there's nothing wrong with doing it with selenium. There will be times where you may need to use it, just don't NEED it here. But now just gives you another tool in your toolbox. – chitown88 Mar 06 '20 at 10:40
0

You don't need selenium because the source code contains the info you need without using JavaScript.
Also, most pages redirect to marks4sure.com/200-301-exam.html, so you'll get the same results. Only marks4sure.com/300-470-exam.html don't.

import requests
from bs4 import BeautifulSoup

urls = ["https://www.marks4sure.com/210-060-exam.html",
 "https://www.marks4sure.com/210-065-exam.html",
 "https://www.marks4sure.com/200-355-exam.html",
 "https://www.marks4sure.com/9A0-127-exam.html",
 "https://www.marks4sure.com/300-470-exam.html",]

with open("output.csv", "w") as f:
    f.write("exam_code,exam_name,exam_quest\n")
    for url in urls:
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html5lib')
        for n, v in enumerate(soup.find_all(class_ = "col-sm-6 exam-row-data")):
            if n == 1:
                exam_code = v.text.strip()
            if n == 3:
                exam_name = v.text.strip()
            if n == 5:
                exam_quest = v.text.strip()
        f.write(f"{exam_code},{exam_name},{exam_quest}\n")
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268