0

I am trying to web-scrape Airbnb with selenium. However it's been a HUGE impossible mission.

First, I create a drive, where the argument "executable_path" is where my chromedriver is installed.

driver = webdriver.Chrome(executable_path=r'C:\directory\directory\directory\chromedriver.exe')

Secondly, I do the other stuffs:

driver.get('https://www.airbnb.com.br/')
a = driver.find_element(By.CLASS_NAME, "cqtsvk7 dir dir-ltr")
a.click()
a.send_keys('Poland')

Here I received the error: NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found

Moreover, when I create the variables to store the html elements, it doesn't work as well:

title = driver.find_elements(By.CLASS_NAME, 'a-size-base-plus a-color-base a-text-normal')
place = driver.find_elements(By.ID, 'title_49247685') 
city = driver.find_elements(By.CLASS_NAME, 'f15liw5s s1cjsi4j dir dir-ltr')
price = driver.find_elements(By.CLASS_NAME, 'p11pu8yw dir dir-ltr')

Please, someone could help me? How can I get the place, city and price of all of my query of place to travel in airbnb? (I know how to store all in a pandas df, my problem is the use of selenium. Those "get_elements" seem not to work properly in airbnb.

Driftr95
  • 4,572
  • 2
  • 9
  • 21
Bruhlickd
  • 69
  • 7
  • Once you do `driver.get('https://www.airbnb.com.br/')` what is the next step you are trying to automate? Where are you trying to invoke `send_keys('Poland')`? – undetected Selenium Jan 18 '23 at 18:45
  • I want that selenium search for me the country where I want to extract the data (Poland, in this case). After that, I want that selenium goes over throught the result pages and colect the name of the city, the location and the price of the houses. – Bruhlickd Jan 18 '23 at 18:49
  • _want that selenium search for me the country where I want to extract the data (Poland, in this case)_: What are the manual steps for that? – undetected Selenium Jan 18 '23 at 18:53
  • Have you tryed that code above? It gives me an error message, probably because: the html element is wrong. I tryed a lot of html elements, and even so it doesn't work. Looks like some stuff of airbnb itself. I don't know, because of that I posted the question. – Bruhlickd Jan 18 '23 at 18:57

1 Answers1

1

I received the error: NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found

Which line is raising this error? I don't see anything in your snippets that could be causing it, but is there anything in your code [before the included snippet], or some external factor that could be causing the automated window to get closed? You could see if any of the answers to this helps you with the issue, especially if you're using .switch_to.window anywhere in your code.


Searching

(You should include screenshots or better descriptions of the fields you are targeting, especially when the issue is that you're having difficulty targeting them.)

Secondly, I do the other stuffs:

driver.get('https://www.airbnb.com.br/')
a = driver.find_element(By.CLASS_NAME, "cqtsvk7 dir dir-ltr")

want that selenium search for me the country where I want to extract the data (Poland, in this case)

If you mean that you're trying to enter "Poland" into this input field, then the class cqtsvk7 in cqtsvk7 dir dir-ltr appears to change. The id attribute might be more reliable; but also, it seems like you need to click on the search area to make the input interactable; and after entering "Poland" you also have to click on the search icon and wait to load the results.

# from selenium.webdriver.support.ui import WebDriverWait

def search_airbnb(search_for, browsr, wait_timeout=5):
    wait_til = WebDriverWait(browsr, wait_timeout).until
    browsr.get('https://www.airbnb.com.br/')

    wait_til(EC.element_to_be_clickable(
        (By.CSS_SELECTOR, 'div[data-testid="little-search"]')))
    search_area = browsr.find_element(
        By.CSS_SELECTOR, 'div[data-testid="little-search"]')
    search_area.click()
    print('CLICKED search_area')

    wait_til(EC.visibility_of_all_elements_located(
        (By.ID, "bigsearch-query-location-input")))
    a = browsr.find_element(By.ID, "bigsearch-query-location-input")
    a.send_keys(search_for)
    print(f'ENTERED "{search_for}"')

    wait_til(EC.element_to_be_clickable((By.CSS_SELECTOR, 
        'button[data-testid="structured-search-input-search-button"]')))
    search_btn = browsr.find_element(By.CSS_SELECTOR, 
        'button[data-testid="structured-search-input-search-button"]')
    search_btn.click()
    print('CLICKED search_btn')


searchFor = 'Poland'
search_airbnb(searchFor, driver)  # , 15) # adjust wait_timeout if necessary

Notice that for the clicked elements, I used By.CSS_SELECTOR; if unfamiliar with CSS selectors, you can consult this reference. You can also use By.XPATH in these cases; this XPath cheatsheet might help then.


Scraping Results

How can I get the place, city and price of all of my query of place to travel in airbnb?

Again, you can use CSS selectors [or XPaths] as they're quite versatile. If you use a function like

def select_get(elem, sel='', tAttr='innerText', defaultVal=None, isv=False):
    try:
        el = elem.find_element(By.CSS_SELECTOR, sel) if sel else elem
        rVal = el.get_attribute(tAttr)
        if isinstance(rVal, str): rVal = rVal.strip()
        return defaultVal if rVal is None else rVal
    except Exception as e:
        if isv: print(f'failed to get "{tAttr}" from "{sel}"\n', type(e), e)
        return defaultVal

then even if a certain element or attribute is missing in any of the cards, it'll just fill in with defaultVal and all the other cards will still be scraped instead of raising an error and crashing the whole program.

You can get a list of dictionaries in listings by looping through the result cards with list comprehension like

listings = [{ 
        'name': select_get(el, 'meta[itemprop="name"]', 'content'),  # SAME TEXT AS
        # 'title_sub': select_get(el, 'div[id^="title_"]+div+div>span'),
        'city_title': select_get(el, 'div[id^="title_"]'),
        'beds': select_get(el, 'div[id^="title_"]+div+div+div>span'),
        'dates': select_get(el, 'div[id^="title_"]+div+div+div+div>span'),
        'price': select_get(el, 'div[id^="title_"]+div+div+div+div+div div+span'),
        'rating': select_get(el, 'div[id^="title_"]~span[aria-label]', 'aria-label')
        # 'url': select_get(el, 'meta[itemprop="url"]', 'content', defaultVal='').split('?')[0],
    } for el in driver.find_elements(
        By.CSS_SELECTOR, 'div[itemprop="itemListElement"]' ## RESULT CARD SELECTOR
    )]

Dealing with Pagination

If you wanted to scrape from multiple pages, you can loop through them. [You can also use while True (instead of a for loop as below) for unlimited pages, but I feel like it's safer like this, even if you set an absurdly high limit like maxPages=5000 or something; either way, it should break out of the loop once it rreaches the last page.]

maxPages = 50 # adjust as preferred
wait = WebDriverWait(browsr, 3) # adjust timeout as necessary

listings, addedIds = [], []
isFirstPage = True
for pgi in range(maxPages):
    prevLen = len(listings) # just for printing progress

    ## wait to load all the cards ##
    try:
        wait.until(EC.visibility_of_all_elements_located(
            (By.CSS_SELECTOR, 'div[itemprop="itemListElement"]')))
    except Exception as e:
        print(f'[{pgi}] Failed to load listings', type(e), e)
        continue # losing one loop for additional wait time

    ## check current page number according to driver ##
    try:
        pgNum = driver.find_element(
            By.CSS_SELECTOR, 'button[aria-current="page"]'
        ).text.strip() if not isFirstPage else '1'
    except Exception as e:
        print('Failed to find pgNum', type(e), e)
        pgNum = f'?{pgi+1}?'

    ## collect listings ##
    pgListings = [{
        'listing_id': select_get(
            el, 'div[role="group"]>a[target^="listing_"]', 'target',
            defaultVal='').replace('listing_', '', 1).strip(),
        # 'position': 'pg_' + str(pgNum) + '-pos_' + select_get(
            # el, 'meta[itemprop="position"]', 'content', defaultVal=''),
        'name': select_get(el, 'meta[itemprop="name"]', 'content'),
        #####################################################  
        ### INCLUDE ALL THE key-value pairs THAT YOU WANT ###
        #####################################################  
    } for el in driver.find_elements(
        By.CSS_SELECTOR, 'div[itemprop="itemListElement"]'
    )]

    ## [ only checks for duplicates against listings frm previous pages ] ##
    listings += [pgl for pgl in pgListings if pgl['listing_id'] not in addedIds]
    addedIds += [l['listing_id'] for l in pgListings]

    ### [OR] check for duplicates within the same page as well ###
    ## for pgl in pgListings:
    ##     if pgl['listing_id'] not in addedIds:
    ##         listings.append(pgl)
    ##     addedIds.append(addedIds) 

    print(f'[{pgi}] extracted', len(listings)-prevLen, 
          f'listings [of {len(pgListings)} total] from page', pgNum)

    ## got to next page ##
    nxtPg = driver.find_elements(By.CSS_SELECTOR, 'a[aria-label="Próximo"]')
    if not nxtPg:
        print(f'No more next page [{len(listings)} listings so far]\n')
        break ### [OR] START AGAIN FROM page1 WITH:
        ## try: _, isFirstPage = search_airbnb(searchFor, driver), True
        ## except Exception as e: print('Failed to search again', type(e), e)
        ## continue
        ### bc airbnb doesn't show all results even across all pages
        ### so you can get a few more every re-scrape [but not many - less than 5 per page]
    try: _, isFirstPage = nxtPg[0].click(), False
    except Exception as e: print('Failed to click next', type(e), e)

dMsg = f'[reduced from {len(addedIds)} after removing duplicates]'
print('extracted', len(listings), 'listings with', dMsg)

[listing_id seems to be the easiest way to ensure that only unique listings are collected. You can also form a link to that listing like f'https://www.airbnb.com.br/rooms/{listing_id}'.]


Combining with Old Data [Load & Save]

If you want to save to CSV and also load previous from the same file with old and new data combined without duplicates, you can do some thing like

# import pandas as pd
# import os

fileName = 'pol_airbnb.csv'
maxPages = 50

try:
    listings = pd.read_csv(fileName).to_dict('records')
    addedIds = [str(l['listing_id']).strip() for l in listings]
    print(f'loaded {len(listings)} previously extracted listings')
except Exception as e:
    print('failed to load previous data', type(e), e)
    listings, addedIds = [], []

#################################################
# for pgi... ## LOOP THROUGH PAGES AS ABOVE #####
#################################################
dMsg = f'[reduced from {len(addedIds)} after removing duplicates]'
print('extracted', len(listings), 'listings with', dMsg)

pd.DataFrame(listings).set_index('listing_id').to_csv(fileName)
print('saved to', os.path.abspath(fileName))

Note that keeping the old data might mean that some the listings are no longer available.

View pol_airbnb.csv for my results with maxPages=999 and searching again instead of break-ing in if not nxtPg.....

Driftr95
  • 4,572
  • 2
  • 9
  • 21