0

I would like to scrape the "Name" & "Address" from the following site:

https://register.fca.org.uk/s/firm?id=001b000000MfNWNAA3

However I am struggling with the referencing the correct field within the page and returning the results

Where I need your help is, to provide a working solution where the query, grabs the "name" from the webpage and provides the output of the "name"

Code:

import string
import pandas as pd
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from IPython.core.display import display, HTML

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

Example Reference:

driver = webdriver.Chrome(chrome_options = options, executable_path=r'C:\Downloads\chromedriver.exe')    
driver.get("https://register.fca.org.uk/s/firm?id=001b000000MfNWNAA3")
title = driver.find_elements(By.CSS_SELECTOR,'.slds-media__body h1 > a')
print(title.text)

Looking forward to your help!

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Masond3
  • 111
  • 6
  • How are you struggling? What issues are you seeing? Are you getting an error? What does the program output? – Captain Jack Sparrow Feb 09 '23 at 15:10
  • @CaptainJackSparrow - I am struggling reference the relevant field, and seeing the desired output. If i can get the query to work for one field (I.e. name) then i know how to proceed for address. So just need working solution, that returns the required output , i will then use this concept for other fields i need to scrape – Masond3 Feb 09 '23 at 15:19
  • Use this selector for the name: `#profile-header > div.page-container.page-container_x-large_gutters.slds-m-bottom_small > div > div > div > div > div > div.slds-media.slds-media_medium > div.slds-media__body > div > h1`. FYI, if the page structure changes, your code will no longer work. – Captain Jack Sparrow Feb 09 '23 at 15:24

3 Answers3

0

Use webdriverwait and wait for visibility of element located.

driver.get("https://register.fca.org.uk/s/firm?id=001b000000MfNWNAA3")
name=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".slds-media__body h1"))).text
print(name)
address=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h4[data-aura-rendered-by] ~p:nth-of-type(1)"))).text
print(address)

you need to import below libaries.

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
KunduK
  • 32,888
  • 5
  • 17
  • 41
  • 1
    Thank you very much for responding so quickly. Seeing the address on the website, the address is broken into different lines (This is also evident inspecting the page as there are
    ) anyway to break this down into individual print statements ?
    – Masond3 Feb 09 '23 at 15:44
0

To extract the Name and Address ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using Name:

    driver.get('https://register.fca.org.uk/s/firm?id=001b000000MfQU0AAN')
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h1"))).text)
    
  • Using Address:

    driver.get('https://register.fca.org.uk/s/firm?id=001b000000MfQU0AAN')
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h4[.//div[contains(., 'Address')]]//following-sibling::p[1]"))).text)
    
  • Console Output:

    Mason Owen and Partners Ltd
    Unity Building
    20 Chapel Street
    Liverpool
    Merseyside
    L3 9AG
    L 3 9 A G
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • @undected selenium - thanks for providing that hyperlink, gives me some addtional context. so thank you. Question for you, on both yours and kundak solution is both populating "L3 9AG" & "L 3 9 A G" (Which is duplicate values) is there a way where the first entry can only be returned – Masond3 Feb 14 '23 at 13:25
0

In addition to using WebDriverWait and visibility_of_element_located like others are suggesting, it's sometimes necessary to scroll an item into view.

This is a little function to make it more convenient to execute the JavaScript that does it:

def scrollto(element):
            driver.execute_script("return arguments[0].scrollIntoView(true);", element)