Webpage formatted in a way that makes selecting text with selenium impossible

Question

This problem is driving me insane: I'm trying to capture the response from a Pandorabot using Selenium but although I can input text and make the bot reply, its webpage is formatted in such a way that makes selecting the output text a nightmare.

This is my code in Python:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

driver = webdriver.Firefox()
driver.get("http://demo.vhost.pandorabots.com/pandora/talk?botid=b0dafd24ee35a477")
elem = driver.find_element_by_name("input")
elem.clear()
elem.send_keys("hello")
elem.send_keys(Keys.RETURN)

line = driver.find_element_by_xpath("(//input)[@name='botcust2']/preceding::font[1]/*")


print(line)
response = line.text
print(response)

driver.close()

which manages to get the first bit of the response ("Chomsky:") but not the rest.

How do I get to properly capture the response text (ideally excluding the bot name)? Is there a more elegant way to do it (eg jquery script) that wouldn't break so easily if the webpage gets reformatted?

Many thanks!

Edit

So, after playing around a bit more with jQuery I found a workaround to the problem of any URL text not showing.

I set the whole text string into a variable and then I replace any instances of the name and empty lines with ''. So the jQuery code as pointed out by pguardiario becomes:

# get the last child text node
response = self.browser.execute_script("""
                  var main_str = $('font:has(b:contains("Chomsky:"))').contents().has( "br" ).last().text().trim();
                  main_str = main_str.replace(/Chomsky:/g,'').replace(/^\\s*[\\r\\n]/gm, '');
                  return main_str;
                """)

I'm sure there may be better/more elegant ways to do the whole thing but for now it works.

Many thanks to pguardiario and everyone else for the suggestions!

Is it ok for you to get text of the font element (not its sub-elements which will exclude text of the font element), strip it and then remove the preceding "Chomsky:" manualy? — yyforbidden, Dec 18 '19 at 01:25
The page doesn't seem to be making any API calls that I could use. If I could strip the preceding "Chomsky" part programmatically, it would be perfectly fine to get the text straight from the font element as I just need the plain text to pass to a text-to-speech variable. — whereismycoffee, Dec 18 '19 at 08:47
Looks like you're close. The html isn't super friendly but you've caught the parent element with your xpath: (//input)[@name='botcust2']/preceding::font[1] now you just need to capture the text element under that one. — DMart, Dec 18 '19 at 19:03

score 1 · Answer 1 · answered Dec 18 '19 at 03:24

1

Since you asked for jQuery:

from requests import get
body = get("http://code.jquery.com/jquery-1.11.3.min.js").content.decode('utf8')
driver.execute_script(body)

# get the last child text node
response = driver.execute_script("""
  return $('font:has(b:contains("Chomsky:"))').contents().last().text().trim()
""")

answered Dec 18 '19 at 03:24

pguardiario

53,827
19
119
159

This works great for text, however it breaks when the bot is outputting text together with urls. If you set the question to "tell me about sweden" it will omit the url text (Sweden) and output "is a country in northern europe, bordering Finland and Norway." Similarly if you query "tell me more about sweden" it will not output anything (the bot provides a detailed response with urls and images). – whereismycoffee Dec 18 '19 at 09:22
In that case maybe just get the font text and regex out the "Chomsky:" – pguardiario Dec 19 '19 at 07:39

score 0 · Answer 2 · answered Dec 22 '19 at 21:33

To capture the response from a Pandorabot using Selenium as the response is within a text node you can execute_script() method as follows:

Code Block:

driver.get('http://demo.vhost.pandorabots.com/pandora/talk?botid=b0dafd24ee35a477')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='input']"))).send_keys("hello")
driver.find_element_by_css_selector("input[value='Ask Chomsky']").click()
print(driver.execute_script("return arguments[0].lastChild.textContent;", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@value='Ask Chomsky']//following-sibling::font[last()]//font")))).strip())

Console Output:
```
Hi! Can I ask you a question?
```

Webpage formatted in a way that makes selecting text with selenium impossible

2 Answers2