1

I've written a script in scrapy to fetch the answers of different questions from a webpage. The problem is the answers are outside the elements I'm currently targeting. I know I could grab them using .next_sibling if I used for BeautifulSoup but in case of scrapy I can't find any idea.

website link

Html elements are like:

  <p>
   <b>
    <span class="blue">
     Q:1-The NIST Information Security and Privacy Advisory Board (ISPAB) paper "Perspectives on Cloud Computing and Standards" specifies potential advantages and disdvantages of virtualization. Which of the following disadvantages does it include?
    </span>
    <br/>
    Mark one answer:
   </b>
   <br/>
   <input name="quest1" type="checkbox" value="1"/>
   It initiates the risk that malicious software is targeting the VM environment.
   <br/>
   <input name="quest1" type="checkbox" value="2"/>
   It increases overall security risk shared resources.
   <br/>
   <input name="quest1" type="checkbox" value="3"/>
   It creates the possibility that remote attestation may not work.
   <br/>
   <input name="quest1" type="checkbox" value="4"/>
   All of the above
  </p>

I've tried so far with:

import requests
from scrapy import Selector

url = "https://www.test-questions.com/csslp-exam-questions-01.php"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
sel = Selector(res)
for item in sel.css("[name^='quest']::text").getall():
    print(item)

The above script prints nothing when exected, It throws no error either.

One of the expected output from above pasted html elements is:

It initiates the risk that malicious software is targeting the VM environment.

I'm only after any css selector solution.

How can I grab the answers of different question from that site?

MITHU
  • 113
  • 3
  • 12
  • 41

3 Answers3

1

Following combination of simple css selectors and python list functions can solve this task:

import scrapy
from scrapy.crawler import CrawlerProcess

class QuestionsSpider(scrapy.Spider):
    name = "TestSpider"
    start_urls = ["https://www.test-questions.com/csslp-exam-questions-01.php"]

    def parse(self,response):
    #select <p> tag elements with questions/answers
        questions_p_tags = [ p for p in response.css("form p")
                             if '<span class="blue"' in p.extract()]
        for p in questions_p_tags:
    #select question and answer variants inside every <p> tag
            item = dict()
            item["question"] = p.css("span.blue::text").extract_first()
    #following list comprehension - select all text, filter empty text elements
    #and select last 4 text elements as answer variants
            item["variants"] = [variant.strip() for variant in p.css("::text").extract() if variant.strip()][-4:]
            yield item

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(QuestionsSpider)
    c.start()
Georgiy
  • 3,158
  • 1
  • 6
  • 18
  • Would you mind taking [this post](https://stackoverflow.com/questions/55907516/cant-get-desired-results-using-proxies) a look just in case there is any solution to offer @Georgiy. Thanks. – robots.txt Apr 29 '19 at 17:23
0

You can try to get text after the tags as following-sibling::text(). Check this example:

>>> sel.css("[name^='quest']").xpath('./following-sibling::text()').extract()
[u'\n   It initiates the risk that malicious software is targeting the VM environment.\n   ', u'\n   ', u'\n   It increases overall security risk shared resources.\n   ', u'\n   ', u'\n   It creates the possibility that remote attestation may not work.\n   ', u'\n   ', u'\n   All of the above\n  ', u'\n   It increases overall security risk shared resources.\n   ', u'\n   ', u'\n   It creates the possibility that remote attestation may not work.\n   ', u'\n   ', u'\n   All of the above\n  ', u'\n   It creates the possibility that remote attestation may not work.\n   ', u'\n   ', u'\n   All of the above\n  ', u'\n   All of the above\n  ']
vezunchik
  • 3,669
  • 3
  • 16
  • 25
  • Thanks @vezunchik for your solution. The thing is i already know that and I created this post to seek any solution related to css selector. Thanks. – MITHU Apr 23 '19 at 12:09
0

You cannot do that at the moment using CSS only.

cssselect, the underlying library behind response.css(), does not support selecting sibling text.

At most you can select the first following element:

>>> selector.css('[name^="quest"] + *').get()
'<br>'
Gallaecio
  • 3,620
  • 2
  • 25
  • 64