0

Below is some of the html I'm trying to scrape using python and selenium.

<h2 class ="page-title">  
    Strange Video Titles
    <span class="duration">28 min</span>  
    <span class="video-hd-mark">720p</span> 
</h2> 

Below is my code:

title=driver.find_element_by_class_name('page-title').text
print(title)

However, when I run this, it prints everything within the h2 tag, including the text in the span classes. I've tried to adding [0] or [1] at the end to specify I only want the first line of text but that doesn't work. How can I only print the video title, which is located above the span classes?

Edit - I think this is the solution

So I've decided to do the following:

title=driver.find_element_by_class_name('page-title').text
duration = driver.find_element_by_xpath('/html/body/div/div[4]/h2/span[1]').text  
vid_quality =driver.find_element_by_xpath('/html/body/div/div[4]/h2/span[2]').text 


if (duration) in title:
    title = title.replace(duration, "")
if(vid_quality) in title:
    title = title.replace(vid_quality,"")

Thank you.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
shorttriptomars
  • 325
  • 1
  • 9

3 Answers3

1

Use .contents

spam = """
<h2 class ="page-title">  
    Strange Video Titles
    <span class="duration">28 min</span>  
    <span class="video-hd-mark">720p</span> 
</h2>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(spam, 'html.parser')
h2 = soup.find('h2')
print(h2.contents[0].strip())

# ALTERNATIVE -  remove the span tags
for span in h2.find_all('span'):
   span.decompose()
print(h2.text.strip())

output

Strange Video Titles
buran
  • 13,682
  • 10
  • 36
  • 61
  • 1
    Hi, thank you so much for taking the time to help. I'm using selenium, not bs4, so your solution wouldn't work. However, if you see the edit I made to my question, I changed how to find the span text. However, I'm still having a bit of difficulty. If it's not too much to ask, could you please take a look at the edit I made? Thank you. – shorttriptomars Nov 11 '20 at 17:56
  • Oh, I overlooked the `selenium` part – buran Nov 11 '20 at 17:59
1

Use WebDriverWait() and wait for visibility_of_element_located()

Use JS executor and use the firstChild to get the title value

element=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2.page-title")))
print(driver.execute_script('return arguments[0].firstChild.textContent;', element))

You need to import below library

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
KunduK
  • 32,888
  • 5
  • 17
  • 41
1

To print only the video title i.e. Strange Video Titles as it is a Text Node you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using XPATH, get_attribute() and splitlines():

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[@class='page-title']"))).get_attribute("innerHTML").splitlines()[1])
    
  • Using CSS_SELECTOR, childNodes and strip():

    print(driver.execute_script('return arguments[0].firstChild.textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h2.page-title")))).strip())
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

References

You can find a couple of relevant detailed discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352