1

This website https://findmasa.com/city/los-angeles/ contains many murals. I want to use python and extract information from the subpages that pop up when clicking the address button, such as https://findmasa.com/view/map#b1cc410b. The information I want to get includes mural id, artist, address, city, latitude, longitude, and link.

When I run the code below, it worked for the first four subpages but stopped at the fifth at this sublink https://findmasa.com/view/map#1456a64a and gave me an error message selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199). Can anyone help me identify the problem and provide a solution? Thank you.

from requests_html import HTMLSession
import warnings
import csv

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC


warnings.filterwarnings("ignore", category=DeprecationWarning) ## ignore the Deprecation warning message

s = HTMLSession()

## define a function to get mural links from different categories
def get_mural_links(page):
    url = f'https://findmasa.com/city/los-angeles/{page}'
    links = []
    r = s.get(url)
    artworks = r.html.find('ul.list-works-cards div.top p')
    for item in artworks:
        links.append(item.find('a', first=True).attrs['href'])
    return links


## define a function to get interested info from a list of links
def parse_mural(url):

    ## get mural id
    spl = '#'
    id = url.partition(spl)[2]

    ## create a Chrome driver instance
    driver = Chrome()
    driver.get(url)

    # wait for the li element to be present on the page
    li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))

    data_lat = li_element.get_attribute('data-lat')
    data_lng = li_element.get_attribute('data-lng')
    city = li_element.find_elements(By.TAG_NAME, 'p')[2].text
    link = url

    try:
        artist = li_element.find_element(By.TAG_NAME, 'a').text
    except:
        artist = 'No Data'

    try:
        address = li_element.find_elements(By.TAG_NAME, 'p')[1].text
    except:
        address = 'No Data'

    info = {
        'ID': id,
        'ARTIST': artist,
        'LOCATION': address,
        'CITY': city,
        'LATITUDE': data_lat,
        'LONGITUDE': data_lng,
        'LINK': link,
    }
    return info


## define a function to save the results to a csv file
def save_csv(results):
    keys = results[0].keys()

    with open('LAmural_MASA.csv', 'w', newline='') as f: ## newline='' helps remove the blank rows in b/t each mural
        dict_writer = csv.DictWriter(f, keys)
        dict_writer.writeheader()
        dict_writer.writerows(results)

## define the main function for this file to export results
def main():
    results = []
    for x in range(1, 3):
        urls = get_mural_links(x)
        for url in range(len(urls)):
            results.append(parse_mural(urls[url]))
            save_csv(results)


## won't run/import to other files
if __name__ == '__main__':
    main()
Jessie H
  • 37
  • 7

1 Answers1

1

As I've answered here,

To fix the InvalidSelectorException that you're getting for some url or better to say for some id number, use the notation li[id="id_value"] instead of li#id_value.

Use this:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li[id="{id}"]')))

Instead of:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
  • Your answer is very helpful. Thank you, Ajeet! Another problem is some websites don't include the elements (such as this - https://findmasa.com/view/map#e6a65dc6) I was looking for. How should I skip these websites, return a message saying 'no matching element was found', and continue running the project for the rest? – Jessie H Jul 20 '23 at 21:32
  • Also, is it possible to output urls under LINK as actual links that can be opened directly from the csv file? Currently, they are just strings/text. – Jessie H Jul 20 '23 at 22:20
  • to your 1st question, you can skip such websites by using `try-except`. Just put the relevant block of codes inside `try` and use `TimeoutException` to handle it since in those cases, it'll Throw `TimeoutException`. – Ajeet Verma Jul 21 '23 at 01:26
  • and yes, of course, you can easily save the uls in a CSV file and later can read them to directly open it from the CSV file – Ajeet Verma Jul 21 '23 at 01:29
  • Can you please explain how and where I should use try-except? I tried to revise the code but got the error message and couldn't ensure I will get the correct result... – Jessie H Jul 21 '23 at 03:47