It is simple to extract all content in p
node with lxml,i extract all content from the webpage's p
node,and write it into a file /tmp/content1.txt
with the following code .
import urllib.request
import lxml.html
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
ob=urllib.request.urlopen(url).read()
root=lxml.html.document_fromstring(ob)
content=root.xpath("//p")
with open('/tmp/content1.txt','w') as fh:
for etxt in content:
fh.write(etxt.text_content() + '\n')
Now make the same job with selenium,write the parsed content in content2.txt
.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(options=chrome_options,executable_path='/usr/bin/chromedriver')
wait = WebDriverWait(browser, 30)
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "p"]))
content = browser.find_elements_by_xpath('//p')
with open('/tmp/content2.txt','w') as fh:
for etxt in content:
fh.write(etxt.text + '\n')
Do as Svetlana Levinsohn suggest:try removing chrome_options.add_argument("--headless").
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(options=chrome_options,executable_path='/usr/bin/chromedriver')
wait = WebDriverWait(browser, 30)
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "p"]))
content = browser.find_elements_by_xpath('//p')
with open('/tmp/content3.txt','w') as fh:
for etxt in content:
fh.write(etxt.text + '\n')
To compare content1.txt
and content2.txt
and content3.txt
.
cd /tmp
wc -c content1.txt
11442 content1.txt
wc -c content2.txt
838 content2.txt
wc -c /tmp/content3.txt
12105 /tmp/content3.txt
1.Why get more lines when to remove chrome_options.add_argument("--headless")
with selenium ?Why is the principle behind this action?
2.Is there a way to get such the same content with selenium as with lxml?
Do as supputuri suggested,to change the last line into fh.write(etxt.get_attribute("textContent") + '\n')
,issue still remains.
wc -c content1.txt
12402 content1.txt
wc -c content2.txt
12410 content2.txt
Let's check why content2.txt is 8 bytes more than content1.txt.
diff content1.txt content2.txt
1c1
< By Ed Silverman @Pharmalot
---
> By Ed Silverman2 @Pharmalot3
3,4c3,4
< As anticipation mounts over the prospects for an experimental Gilead Sciences (GILD) drug to combat the novel coronavirus, two Wall Street analysts suggested it remains uncertain whether the antiviral therapy will be successful after assessing a new paper that examined a dozen U.S. patients.
< The paper, published on a preprint server without peer review, described the epidemiology, clinical course, and viral characteristics of the first 12 U.S. patients with Covid-19, only three of whom were treated with remdesivir, which was developed to treat the Ebola virus but shelved after proving less effective than other drugs during testing. The analysis was conducted by the Centers for Disease Control and Prevention Covid-19 response team.
---
> As anticipation mounts over the prospects for an experimental Gilead Sciences (GILD4) drug to combat the novel coronavirus, two Wall Street analysts suggested it remains uncertain whether the antiviral therapy will be successful after assessing a new paper that examined a dozen U.S. patients.
> The paper5, published on a preprint server without peer review, described the epidemiology, clinical course, and viral characteristics of the first 12 U.S. patients with Covid-19, only three of whom were treated with remdesivir, which was developed to treat the Ebola virus but shelved after proving less effective than other drugs during testing. The analysis was conducted by the Centers for Disease Control and Prevention Covid-19 response team.
22,24c22,24
< Coronavirus
< drug development
< research
---
> Coronavirus10
> drug development11
> research12
26c26
< Republish this article
---
> Republish this article13
59c59
<
---
>
Bytes in content2.txt ,not in content1.txt.
line1 2,3 line3-4 4,5 line22-24 10,11,12 line26 13
4
bytes to store 2,3,4,5
8
bytes to store 10,11,12,13
Bytes in content1.txt ,not in content2.txt.
line59
For ,it needs 4 bytes f09f918d
to store .
4+8-4 = 8 = 12410-12402
Note : the content parsed by lxml or selenium change dynamically,you maybe get different bytes for content1.txt and content2.txt.
It is time to check another important issue.
For the first line in content1.txt parsed by lxml.
By Ed Silverman @Pharmalot
For the first line in content2.txt parsed by selenium.
By Ed Silverman2 @Pharmalot3
Why selenium add 2
and 3
here?selenium
add some numbers which is not in original webpage,what do them mean?
And i have never seen javascript code to change the dom tree of the webpage.
How to prevent selenium from adding the numbers when to get get_attribute("textContent")
?
Vladimir M
gave a notice that all numbers are in the original site.
I made a verification.
import urllib.request
import lxml.html
url = 'https://www.statnews.com/pharmalot/2020/03/13/gilead-coronavirus-covid19-clinical-trials/'
ob=urllib.request.urlopen(url).read()
root=lxml.html.document_fromstring(ob)
content=root.xpath("//p[@class='author']")[0]
lxml.html.tostring(content)
We get the html source code:
b'<p class="author">
<em>By</em>
<a ...>Ed Silverman</a>
<a ...>@Pharmalot</a>
</p>'
Do not contain such tag sup
as Vladimir M
show:
<p class="author">
<em>By</em>
<a ...>Ed Silverman</a>
<sup class="footnote">3</sup>
<a ...>@Pharmalot</a>
<sup class="footnote">4</sup>
</p>
If the original source html code contain sup
tag,text_content
in lxml can show it.
import lxml.html as lh
data = """<p class="author"><em>By</em> <a href="https://www.statnews.com/staff/ed-silverman/" \
class="author-name-link author-name author-main">Ed Silverman</a><sup class="footnote">3</sup> \
<a href="https://twitter.com/Pharmalot" class="author-social" target="_blank" rel="noopener"> \
@Pharmalot</a><sup class="footnote">4</sup> </p>"""
doc = lh.fromstring(data)
data = doc.xpath('//p')[0]
print(data.text_content())
It output the below:
By Ed Silverman3 @Pharmalot4
I infer that the two tag sup
were created by some javascript code.
To improve my javascript knowledge ,the last issue is related with js:
How to know which js file create the number located at <p class="author">
node?
Please answer it and get the 500 points.