1

I want to extract all the paragraphs from this article but I managed to only get the first paragraph using selenium for python. The article link is: https://nthqibord.com/2019/08/15/pemimpin-pkr-pertahan-tun-mahathir/

I'm doing this as practice but can't extract the whole article.

I already tried the code below to extract the exact portion of the paragraph:

post = driver.find_element_by_xpath("//div[@class='td-ss-main-content']/div[@class='td-post-content']//p")

It resulted in only getting the first paragraph. I need all the paragraphs.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
nam
  • 116
  • 1
  • 10

5 Answers5

0

find_element_by_xpath will return a single element, so you have to use find_elements_by_xpath to return all matching elements.

posts = driver.find_elements_by_xpath("//div[@class='td-ss-main-content']/div[@class='td-post-content']//p")
supputuri
  • 13,644
  • 2
  • 21
  • 39
  • I tried that just now. The code I did `for posts in post: posts = posts.text` and I got only the last row. – nam Aug 19 '19 at 02:17
  • Not sure what you are trying to do with the loop. – supputuri Aug 19 '19 at 02:19
  • I need the loop because I see that if I did find_elements_by_xpath, it gives me a list of the webElements. I need the text of the posts you see. – nam Aug 19 '19 at 02:44
0
para = []

for p in driver.find_elements_by_xpath("//div[@class='td-ss-main-content']/div[@class='td-post-content']//p"):
        para.append(p.text)
posts = " ".join(para)
barbsan
  • 3,418
  • 11
  • 21
  • 28
nam
  • 116
  • 1
  • 10
0

Try like this:

content = ''
for (i in len(driver.find_elements_by_xpath("//div[@class='td-ss-main-content']/div/p"))):
    content = content + driver.find_elements_by_xpath("(//div[@class='td-ss-main-content']/div/p)[" + str(i+1) + "]").text
print(content)
Naveen
  • 770
  • 10
  • 22
0

To extract all the paragraphs from article using Selenium and Python you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get("https://nthqibord.com/2019/08/15/pemimpin-pkr-pertahan-tun-mahathir/")
    print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.td-post-content p")))])
    
  • Using XPATH:

    driver.get("https://nthqibord.com/2019/08/15/pemimpin-pkr-pertahan-tun-mahathir/")
    print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='td-post-content']//p")))])
    
  • Console Output:

    ['DESAKAN pemimpin PKR, Hassan Abdul Karim yang mendesak Perdana Menteri Tun Mahathir Mohamad meletak jawatan ternyata tidak disambut rakan separtinya.', 'Setiasusaha Agung PKR, Datuk Seri Saifuddin Nasution Ismail berkata, Ahli Parlimen Pasir Gudang itu sepatutnya lebih menumpukan isu berkaitan rakyat.', 'Beliau telah menghubungi Hassan sebaik desakan tersebut dibuat semalam dan mahu menghentikan tindakan berkenaan.', 'Beliau juga telah menghubungi Hassan sebaik desakan tersebut dibuat semalam dan mahu menghentikan tindakan berkenaan.', '“Saya telah menghubungi beliau (Hasan) dan minta fokus kepada isu rakyat.', '“Tinggalkan ia kepada barisan kepemimpinan PKR,” katanya ketika ditemui pemberita di sini hari ini.', 'Hassan semalam mencadangkan Dr. Mahathir supaya meletak jawatan selepas apa yang didakwanya Perdana Menteri itu seperti hilang punca dan hilang daya dalam menyelesaikan beberapa isu kritikal negara.', 'Menurut Hassan, beliau adalah antara ahli Parlimen yang turut menandatangani surat sokongan kepada Tun Mahathir untuk dilantik sebagai Perdana Menteri selepas Pakatan Harapan berjaya membentuk kerajaan pada pilihan raya umum lalu.', 'Beliau juga menegaskan sumbangan negarawan berusia 94 tahun itu akan tetap dikenang dan dihormati. – 15 Ogos 2019.']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0
list = [p.text for p in self.driver.find_elements_by_xpath("//div[@class='td-ss-main-content']/div[@class='td-post-content']//p")]
paragraphs = "\n  ".join(map(str, list))
  • 2
    Please add some explanation why this answers the question – Hintham Aug 19 '19 at 14:54
  • ```driver.find_elements_by_xpath``` will return the locators of each paragraph and using the for loop and ```p.text``` you can get a text (paragraph) which you are storing into the list. ```"\n ".join(map(str, list))``` will join all paragraphs as per the current web view – Ronak Patel Aug 20 '19 at 10:57