Why can't extract all content in p node from the webpage with selenium?

Question

It is simple to extract all content in p node with lxml,i extract all content from the webpage's p node,and write it into a file /tmp/content1.txt with the following code .

import urllib.request
import lxml.html
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
ob=urllib.request.urlopen(url).read()
root=lxml.html.document_fromstring(ob)
content=root.xpath("//p")
with open('/tmp/content1.txt','w') as fh:
    for etxt in content:
        fh.write(etxt.text_content() + '\n')

Now make the same job with selenium,write the parsed content in content2.txt.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(options=chrome_options,executable_path='/usr/bin/chromedriver')

wait = WebDriverWait(browser, 30)
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "p"]))
content = browser.find_elements_by_xpath('//p')
with open('/tmp/content2.txt','w') as fh:
    for etxt in content:
        fh.write(etxt.text + '\n')

Do as Svetlana Levinsohn suggest:try removing chrome_options.add_argument("--headless").

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(options=chrome_options,executable_path='/usr/bin/chromedriver')

wait = WebDriverWait(browser, 30)
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "p"]))
content = browser.find_elements_by_xpath('//p')
with open('/tmp/content3.txt','w') as fh:
    for etxt in content:
        fh.write(etxt.text + '\n')

To compare content1.txt and content2.txt and content3.txt.

cd  /tmp
wc -c content1.txt
11442 content1.txt
wc -c content2.txt
838 content2.txt
wc -c /tmp/content3.txt
12105 /tmp/content3.txt

1.Why get more lines when to remove chrome_options.add_argument("--headless") with selenium ?Why is the principle behind this action?
2.Is there a way to get such the same content with selenium as with lxml?

Do as supputuri suggested,to change the last line into fh.write(etxt.get_attribute("textContent") + '\n'),issue still remains.

wc -c content1.txt
12402 content1.txt
wc -c content2.txt
12410 content2.txt

Let's check why content2.txt is 8 bytes more than content1.txt.

diff content1.txt  content2.txt
1c1
< By Ed Silverman @Pharmalot 
---
> By Ed Silverman2 @Pharmalot3 
3,4c3,4
< As anticipation mounts over the prospects for an experimental Gilead Sciences (GILD) drug to combat the novel coronavirus, two Wall Street analysts suggested it remains uncertain whether the antiviral therapy will be successful after assessing a new paper that examined a dozen U.S. patients.
< The paper, published on a preprint server without peer review, described the epidemiology, clinical course, and viral characteristics of the first 12 U.S. patients with Covid-19, only three of whom were treated with remdesivir, which was developed to treat the Ebola virus but shelved after proving less effective than other drugs during testing. The analysis was conducted by the Centers for Disease Control and Prevention Covid-19 response team.
---
> As anticipation mounts over the prospects for an experimental Gilead Sciences (GILD4) drug to combat the novel coronavirus, two Wall Street analysts suggested it remains uncertain whether the antiviral therapy will be successful after assessing a new paper that examined a dozen U.S. patients.
> The paper5, published on a preprint server without peer review, described the epidemiology, clinical course, and viral characteristics of the first 12 U.S. patients with Covid-19, only three of whom were treated with remdesivir, which was developed to treat the Ebola virus but shelved after proving less effective than other drugs during testing. The analysis was conducted by the Centers for Disease Control and Prevention Covid-19 response team.
22,24c22,24
< Coronavirus
< drug development
< research
---
> Coronavirus10
> drug development11
> research12
26c26
<                                   Republish this article
---
>                                   Republish this article13
59c59
< 
---
>

Bytes in content2.txt ,not in content1.txt.

line1 2,3 line3-4 4,5 line22-24 10,11,12 line26 13

4 bytes to store 2,3,4,5 8 bytes to store 10,11,12,13

Bytes in content1.txt ,not in content2.txt.

line59

For ,it needs 4 bytes f09f918d to store .

4+8-4 = 8 = 12410-12402

Note : the content parsed by lxml or selenium change dynamically,you maybe get different bytes for content1.txt and content2.txt.

It is time to check another important issue.
For the first line in content1.txt parsed by lxml.

By Ed Silverman @Pharmalot

For the first line in content2.txt parsed by selenium.

By Ed Silverman2 @Pharmalot3

Why selenium add 2 and 3 here?selenium add some numbers which is not in original webpage,what do them mean?
And i have never seen javascript code to change the dom tree of the webpage.
How to prevent selenium from adding the numbers when to get get_attribute("textContent")?

Vladimir M gave a notice that all numbers are in the original site. I made a verification.

import urllib.request
import lxml.html
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
ob=urllib.request.urlopen(url).read()
root=lxml.html.document_fromstring(ob)
content=root.xpath("//p[@class='author']")[0]
lxml.html.tostring(content)

We get the html source code:

 b'<p class="author">
     <em>By</em> 
     <a ...>Ed  Silverman</a> 
     <a ...>@Pharmalot</a> 
   </p>'

Do not contain such tag sup as Vladimir M show:

<p class="author">
  <em>By</em> 
  <a ...>Ed Silverman</a>
  <sup class="footnote">3</sup> 
  <a ...>@Pharmalot</a>
  <sup class="footnote">4</sup> 
</p>

If the original source html code contain sup tag,text_content in lxml can show it.

import lxml.html as lh
data = """<p class="author"><em>By</em> <a href="https://www.statnews.com/staff/ed-silverman/" \
class="author-name-link author-name author-main">Ed Silverman</a><sup class="footnote">3</sup> \
<a href="https://twitter.com/Pharmalot" class="author-social" target="_blank" rel="noopener">  \
@Pharmalot</a><sup class="footnote">4</sup> </p>"""
doc = lh.fromstring(data)
data = doc.xpath('//p')[0]
print(data.text_content())

It output the below:

By Ed Silverman3   @Pharmalot4

I infer that the two tag sup were created by some javascript code.
To improve my javascript knowledge ,the last issue is related with js:
How to know which js file create the number located at <p class="author"> node?
Please answer it and get the 500 points.

try removing `chrome_options.add_argument("--headless")`, it may solve this — Svetlana Levinsohn, Mar 16 '20 at 19:21
Presence of all elements located returns when at least one is found: https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/support/ui/ExpectedConditions.html#presenceOfAllElementsLocatedBy-org.openqa.selenium.By- A standard sleep will be the simplest way here... (btw, you don't need the first wait there, Selenium will wait for ready state after a get) — pcalkins, Mar 16 '20 at 22:27
If you check the source code, numbers are added to the text if it is actually a link (anchor). — Vladimir M, Mar 19 '20 at 09:59

Vladimir M · Accepted Answer · 2020-03-22T02:10:28.327

I've been playing around with your problem and site trying to figure out what exactly is going on. Here what I have found. (my previous answer perhaps was wrong or at least incoplete)

Firstly, selenium does not add lines which are not in the original. They are in the original site, its just that lxml displays those differently. I don't know much about lxml, so wont discuss it further.

Secondly, lets figure out what these numbers are. Lets take

By Ed Silverman3 @Pharmalot4

The code for it is

<p class="author">
  <em>By</em> 
  <a ...>Ed Silverman</a>
  <sup class="footnote">3</sup> 
  <a ...>@Pharmalot</a>
  <sup class="footnote">4</sup> 
</p>

Notice the numbers? (btw, those are somewhat changed since your original posting)

The numbers are there. And there is a logic for those number to be displayed.

Next thing to check is this:

https://www.w3schools.com/jsref/prop_node_innertext.asp

Basically, textContent will return all the text inside the element. That's why you get the numbers in your code.

innerText will respect the CSS visibility rules for elements. So, yes, you DO get less text with innerText then with textContent.

But now you have to decide, what exactly you need to achieve. Using innerText should be a correct way to return visible text if that's what you need.

content = driver.find_elements_by_xpath('//p')

with open('content0_innerText.txt','w') as fh:
    for etxt in content:
        fh.write(etxt.get_attribute('innerText') + '\n')

However when I tried it, there are still some numbers that are visible for some links. Perhaps they are supposed to be visible in CSS. In any case you can do some modifications to the styles of elements OR to the page to get the content you want, for example, by removing all the elements that contain these numbers:

content = driver.find_elements_by_xpath('//sup')
for etxt in content:
    driver.execute_script("return arguments[0].remove();", etxt)

import time
time.sleep(1)

content = driver.find_elements_by_xpath('//p')

with open('content0_innerText_remove.txt','w') as fh:
    for etxt in content:
        fh.write(etxt.get_attribute('innerText') + '\n')

You may also try to modify the styles of the page/elements. But it might be more work then simply removing those.

Hope this helps.

With regards to where this 'sup' tags are added

It is typically not very easy to say which file does it for sure. By inspecting the network tab in chrome after loading this site, I suspect that the functionality is in:

stat-theme.js file. (https://www.statnews.com/wp-content/compiled/js/stat-theme.js?ver=7206f7890c08d8e03e22ec8af0b756cf39f84bae)

Namely function processLinks. Since it is 'compiled' its not very readable. But it seems that what it does is going through all the links, does some pattern matching and inserts sup element after the href element. I am not going to paste code in here, because I it might breach licenses, but you should be able to locate it in that file.

And it seems to be called on init. Judging by the file name, it is a part of wordpress functionality or one of its plugins.

Previous version:

I've noticed, that extra numbers happen when the text in question is actually an anchor. After doing some searching, I think you are facing the similar issue as:

Difference between text and innerHTML using Selenium

Perhaps you want to use .get_attribute('innerText') as was suggested.

With `.get_attribute('innerText')` ,i get less content than `get_attribute("textContent")`. — showkey, Mar 19 '20 at 11:27
@it_is_a_literature well. as mentioned in the linked question, it often depends on the web driver implementation on how the inner text is rendered. You may try something that I used in one of my test automation projects. You could try to execute javascript with the execute_script method. That will allow you to use browser directly. Downside is that you loose all the benefits of finding elements in selenium. But you may at least try if that will return correct text with innerText attribute. — Vladimir M, Mar 19 '20 at 12:25
Please show me the code to get content which contain no numbers with selenium,i have read the material you provide,still can't write proper code. — showkey, Mar 21 '20 at 03:43
@it_is_a_literature I've went and did some experimenting. updated the answer — Vladimir M, Mar 21 '20 at 14:11
Please see the content which begin with `Vladimir M` util the end in my updated posts,i infer that `sup` element were created by some javascript code,how to know which js create the number? — showkey, Mar 22 '20 at 01:27
@it_is_a_literature added another section to my answer. I think I found the code. follow the instructions to find it yourself, cause posting it in here might be against licenses. — Vladimir M, Mar 22 '20 at 02:11

supputuri · Answer 2 · 2020-03-19T02:34:22.597

1

The difference between content1.txt and content2.txt is due to the way how we are getting the text from the source.

In content1.txt case you are getting the text_content(), but when it comes to content2.txt you are getting text. And text is not same as textContent, Because of this you are missing number of lines in content2.txt. The solution in headless solution is to change the last line to

 fh.write(etxt.get_attribute("textContent"))

When I run with normal browser the header at top have extra p element with textContent TRY STAT PLUS which is not present in the head-less or lxml approach. Because of this new p element the size of file is little higher than first 2 approaches.

Browser Screenshot: Headless screenshot:

Apart from TRY STAT PLUSRead Now rest all text content is same in all 3 approaches.

edited Mar 19 '20 at 02:34

answered Mar 19 '20 at 02:09

supputuri

13,644
2
21
39

2

The reason for headless issue would be the screen size, try going with higher headless canvas and all results should be the same then – Tarun Lalwani Mar 19 '20 at 08:04
Please have a look at my updated post,selenium add some numbers which is not in original webpage,what do them mean? – showkey Mar 19 '20 at 09:09
How to prevent selenium from adding the numbers when to get get_attribute("textContent")? – showkey Mar 19 '20 at 10:08
try with `etxt.get_attribute("innerText")` this should remove the numbers. – supputuri Mar 20 '20 at 02:46

score 0 · Answer 3 · answered Mar 24 '20 at 21:28

0

You might not want to try using--headless for chrome_options. It does speed up the program, but it doesn't use user interface at all sometimes. That could be the problem here, from what I can discern.

answered Mar 24 '20 at 21:28

boi yeet

86
10

Why can't extract all content in p node from the webpage with selenium?

3 Answers3

Previous version: