How to scrape the comments using Selenium and Python?

Question

I want to extract 'comment' from a website. I already tried using selenium and extract it using xpath but it not works.

from selenium import webdriver
import pandas as pd
            
driver = webdriver.Chrome()
driver.get('https://finance.detik.com/berita-ekonomi-bisnis/d-5307853/ri-disebut-punya-risiko-korupsi-yang-tinggi?_ga=2.13736693.357978333.1608782559-293324864.1608782559')
            
userid_element = driver.find_elements_by_xpath('//*[@id="cmt66364625"]/div[1]/div[1]/text()')[0]
userid = userid_element.text

This the result :

IndexError                                Traceback (most recent call last)
<ipython-input-73-151acf07e320> in <module>
----> 1 userid_element = driver.find_elements_by_xpath('//*[@id="cmt66364625"]/div[1]/div[1]/text()')[0]
      2 userid = userid_element.text

IndexError: list index out of range

i tried to delete the list index

userid_element = driver.find_elements_by_xpath('//*[@id="cmt66364625"]/div[1]/div[1]/text()')
userid = userid_element.text

but the result is :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-74-890ba28d7494> in <module>
      1 userid_element = driver.find_elements_by_xpath('//*[@id="cmt66364625"]/div[1]/div[1]/text()')
----> 2 userid = userid_element.text

AttributeError: 'list' object has no attribute 'text'

score 0 · Answer 1 · answered Dec 25 '20 at 09:49

0

userid = [i.text for i in userid_element]
print(userid)

Find_elemnts returns a list you have to iterate through each element . You can use above code to iterate and get text from each element and store it an array

answered Dec 25 '20 at 09:49

PDHide

18,113
2
31
46

score 0 · Answer 2 · answered Dec 25 '20 at 10:32

0

if you want all the comments you can do like this

comment_elements = driver.find_elements_by_xpath("//div[@class='comment__cmt_box_text___3bK3O comment__cmt_dk_komen___1Yzyg']")
comments = [comment.text for comment in comment_elements]

answered Dec 25 '20 at 10:32

marco

525
4
11

score 0 · Accepted Answer · answered Dec 25 '20 at 22:42

To scrape the comments from the website as the comments are within an <iframe> so you have to:

Induce WebDriverWait for the desired frame to be available and switch to it.
Induce WebDriverWait for the desired visibility_of_all_elements_located().

You can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get('https://finance.detik.com/berita-ekonomi-bisnis/d-5307853/ri-disebut-punya-risiko-korupsi-yang-tinggi?_ga=2.13736693.357978333.1608782559-293324864.1608782559')
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe.xcomponent-component-frame.xcomponent-visible")))
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[class^='comment__cmt_'][style]")))])

Using XPATH:

driver.get('https://finance.detik.com/berita-ekonomi-bisnis/d-5307853/ri-disebut-punya-risiko-korupsi-yang-tinggi?_ga=2.13736693.357978333.1608782559-293324864.1608782559')
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@class='xcomponent-component-frame xcomponent-visible']")))
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[starts-with(@class, 'comment__cmt_')][@style]")))])

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Console Output:

['buzzer pada kmenaa..giliran muhammdiyah ampe 400an komen..dapseeee\nLaporkan\n0BalasBagikan:  ', 'selama korupsi tidak dihukum mati disanalah korupsi masih liar dan ada kalaupun dibuat hukum mati setidaknya bisa mengurangi angka korupsi itu\nLaporkan\n2BalasBagikan:  ', 'kalo terindikasi korupsi, lalu teriak saya pancasila, biar pd takut\nLaporkan\n0BalasBagikan:  ', '1. Hukuman fisik diperberat. Hukuman sosial diadakan.\nLaporkan\n0BalasBagikan:  ', 'Padahal fokus tegakan hukum dan berantas korupsi otomatis ekonomi terangkat. Hukum tegak ekonomi kuat. Bayangkan setingkat RT aja korupsi. Dan herannya koruptor serasa lebih dihormatin dari pelaku kejahatan lain.\nLaporkan\n0BalasBagikan:  ', 'Bikin UU cashless aja Bu. Transaksi cash maks 1jt. Jadi lebih enak ditracing\nLaporkan\n0BalasBagikan:  ', 'Hukum terlalu lemah, yang pernah korupsi malah masih menjabat pemerintahaan dan malah masih mencalonkan diri sebagai bupati atau walikota dan gubernur setelah melakukan korupsi.\nLaporkan\n0BalasBagikan:  ', 'system birokrasi yg lemah, seharusnya mulai mengandalkan teknologi kontrol online untuk mengurangi kesempatan pejabat yg korupsi\nLaporkan\n0BalasBagikan:  ', 'Bukan cuma resiko, emang udah kejadian kaleeee hahahhahahaha\nLaporkan\n0BalasBagikan:  ', 'ga heran jamannya new orba\nLaporkan\n1BalasBagikan:  ']

Reference

You can find a couple of relevant discussions in:

How to scrape the comments using Selenium and Python?

3 Answers3

Reference