0

HTML div class that contains the data I wish to print

enter image description here

<div class="gs_a">LR Binford&nbsp;- American antiquity, 1980 - cambridge.org </div>

This is my code so far :

from selenium import webdriver

def Author (SearchVar):

    driver = webdriver.Chrome("/Users/tutau/Downloads/chromedriver")

    driver.get ("https://scholar.google.com/")

    SearchBox = driver.find_element_by_id ("gs_hdr_tsi")

    SearchBox.send_keys(SearchVar)

    SearchBox.submit()

    At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

    print (At)

Author("dog")

All that comes out when I print is

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

not the text I am new to selenium Help is appreciated

Andersson
  • 51,635
  • 17
  • 77
  • 129
Te Uruti Tau
  • 13
  • 1
  • 4
  • 1
    Possible duplicate of [How to get text with selenium web driver in python](https://stackoverflow.com/questions/20996392/how-to-get-text-with-selenium-web-driver-in-python) – Andersson Jun 07 '18 at 05:07
  • 1
    Can you please paste the HTML. The screenshot is not so helpful. – Monika Jun 07 '18 at 05:07
  • You should use `driver.find_element_by_css_selector`, rather than `driver.find_elements_by_css_selector`. And should be `print (At.text)` – yong Jun 07 '18 at 05:15
  • 1
    you are printing the element with print(At),use `print(At.text)` instead, not related but i suggest using requests with Beautifulsoup instead of selenium – raviraja Jun 07 '18 at 05:43

3 Answers3

1

Intro

First, I recommend to css-select your target on selenium's page_source using a faster parser.

import lxml
import lxml.html

# put this below SearchBox.submit()

CSS_SELECTOR = '#gs_res_ccl_mid > :nth-child(1) > .gs_ri > .gs_a' # Define css
source = driver.page_source                                       # Get all html
At_raw = lxml.html.document_fromstring(source)                    # Convert
At = At_raw.cssselect(CSS_SELECTOR)                               # Select by CSS

Solution 1

Then, you need to extract the text_content() from your web element and encode it properly.

At = At.text_content().encode('utf-8') # Get text and encode
print At

Solution 2

In case At contains more than one line and unicode, you can also remove those:

At = [l.replace(r'[^\x00-\x7F]+','') for line in At \                 # replace unicode
         for l in line.text_content().strip().encode('utf-8').splitlines() \ # Get text
               if l.strip()]                # only consider if line contains characters
print At
sudonym
  • 3,788
  • 4
  • 36
  • 61
  • OP explicitly said that wants to get output ***using selenium in python*** while you suggests to use `lxml` which looks much more complicated than simply add the `text` property... – Andersson Jun 07 '18 at 06:13
  • my proposed solution requires python and selenium. (driver.page_source) . In fact, that is the first sentence of my answer. I suggest to use a different PARSER for performance reasons and I also suggest to use a way of text extraction that works in all scenarios, not just in some. – sudonym Jun 07 '18 at 06:16
  • If `text`doesn't work, OP might use `get_attribute("textContent")`. Also using third-party library to extract one text value doesn't seem to bring much efficiency or improvements – Andersson Jun 07 '18 at 06:19
  • I agree with you. As soon as OP decides to scrape more than one value in the future, my code might help more. I benchmarked this and in essence doubled my throughput/s using sel's page_source + lxml compared to vanilla selenium. In the meanwhile, let's hope his value does not contain any currency symbols. – sudonym Jun 07 '18 at 06:22
1

Seems you were almost there. Perhaps, as per the HTML and your code trials you have shared, you are seeing the desired output.

Explaination

Once the following line of code gets executed:

At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

WebElement At refers to the desired element(single element in your list). In your next step, as you invoked print (At) the WebElement At is printed which is as follows:

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

Solution

Now, as per your question, if you want to extract the text LR Binford - American antiquity, 1980 - cambridge.org, you have to invoke either of the methods through the element:

So you need to change the line of code from:

print (At)

To either of the following:

  • Using text:

    print(At.text)
    
  • Using get_attribute(attributeName):

    print(At.get_attribute("innerHTML"))
    
  • Your own code with minor adjustments:

    # -*- coding: UTF-8 -*-
    from selenium import webdriver
    
    def Author (SearchVar):
    
        options = webdriver.ChromeOptions() 
        options.add_argument("start-maximized")
        options.add_argument('disable-infobars')
        driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
        driver.get ("https://scholar.google.com/")
        SearchBox = driver.find_element_by_name("q")
        SearchBox.send_keys(SearchVar)
        SearchBox.submit()
        At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')
        for item in At:
            print(item.text)
    
    Author("dog")
    
  • Console Output:

    …, RJ Marles, LS Pellicore, GI Giancaspro, TL Dog - Drug Safety, 2008 - Springer
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Well there's an indentation error in the for loop (at the print) and you don't need the 'div' in the CSS selector. Again: this will throw an error in case there is Unicode in the element of interest – sudonym Jun 07 '18 at 12:45
  • 1
    I only see this because YOU (among others) helped me with your contributions on SO in the past – sudonym Jun 07 '18 at 12:50
  • Thanks @sudonym Keep an eye over my answers time to time. Your feedback and support always brings the best out of me. Not sure why the indentation doesn't gets pasted as it should. However corrected it and added cushion for _Unicode_ as well. But I am not in favour of any change to _OP's approach_ until and unless it is absolutely necessary. Essentially that kills OP's innovation. Hence `css_selector` I left untouched. – undetected Selenium Jun 07 '18 at 12:53
  • 1
    Get it - I'll keep monitoring your contributions trust on that – sudonym Jun 07 '18 at 12:58
  • cheers bro much appreciated, this code will jot work in atom however had to switch to visual studios. throws a unicode error. – Te Uruti Tau Jun 09 '18 at 01:02
0

You are printing the element. Print (At.text) instead of At.

Monika
  • 714
  • 1
  • 4
  • 10
  • AFAIK this won't work if you are dealing with unicode (currency symbols etc.). Also, this won't remove whitespace-only lines and similar artefacts – sudonym Jun 07 '18 at 05:45