0

I'm trying to get all the events and additional metadata to those events from this webpage : https://alando-palais.de/events

My problem is, that the result(html) doesn't contain the information I'm looking for. I guess, they are "hidden" behind some php script. This url: 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

Any idea, on how to wait until the page is completely loaded, or what kind of method do I have to use, to get the event information?

This is my script right now :-) :

from bs4 import BeautifulSoup
from urllib.request import urlopen, urljoin
from urllib.parse import urlparse
import re
import requests

if __name__ == '__main__':
    target_url = 'https://alando-palais.de/events'
    #target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

    soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')
    print(soup)

    links = soup.find_all('a', href=True)
    for x,link in enumerate(links):
        print(x, link['href'])


#    for image in images:
#        print(urljoin(target_url, image))

Expected output would be something like:

That's something out of this result:

<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">
    <a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai &#038; Friends" class="vc_gitem-link vc-zone-link"></a>    <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt="">  <div class="vc_gitem-zone-mini">
        <div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left">   <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019
    </div>
wpercy
  • 9,636
  • 4
  • 33
  • 45
Xenobiologist
  • 2,091
  • 1
  • 12
  • 16
  • what's the expected output? One example perhaps as I'm unsure what you mean by metadata. – QHarr Mar 08 '19 at 20:55
  • I've added some results, I'd like to extract. – Xenobiologist Mar 08 '19 at 21:01
  • if php requests are fired off by javascript, the results will not be available when the base page is loaded, you would have to render it to have the data calls made.. maybe use selenium to render the results and then get the final page out when it is done. – danchik Mar 08 '19 at 21:13
  • Yeah, I thought about using selenium or pyqt to emulate a "real" browser. Could you provide a few lines to get me started with selenium? – Xenobiologist Mar 08 '19 at 21:15
  • this has some info https://stackoverflow.com/questions/29404856/how-can-i-render-javascript-html-to-html-in-python – danchik Mar 08 '19 at 21:15

2 Answers2

2

You could mimic the xhr post made by the page

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

data = {

  'action': 'vc_get_vc_grid_data',
  'vc_action': 'vc_get_vc_grid_data',
  'tag': 'vc_basic_grid',
  'data[visible_pages]' : 5,
  'data[page_id]' : 30,
  'data[style]' : 'all',
  'data[action]' : 'vc_get_vc_grid_data',
  'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',
  'data[tag]' : 'vc_basic_grid',
  'vc_post_id' : '30',
  '_vcnonce' : 'cc8cc954a4'  

}

res = requests.post(url, data = data)
soup = BeautifulSoup(res.content, 'lxml')
dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]
textInfo = [item for item in soup.select('.vc_gitem-link')][::2]
imageLinks = [item['src'].strip() for item in soup.select('img')]
titles = []
links = []
for item in textInfo:
    titles.append(item['title'])
    links.append(item['href'])
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])
print(results)

Or with selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://alando-palais.de/events#'
driver = webdriver.Chrome()
driver.get(url)

dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]
textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]
textInfo = textInfo[: int(len(textInfo) / 2)]
imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]
titles = []
links = []

for item in textInfo:
    titles.append(item.get_attribute('title'))
    links.append(item.get_attribute('href'))
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])

print(results)

driver.quit()
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks a lot. The first script returns: Empty DataFrame Columns: [title, date, link, imageLink] Index: [] The other run into errors. I guess, I have to setup selenium first. – Xenobiologist Mar 09 '19 at 12:31
  • Thanks. I did the setup and changed the code above to use Firefox in lieu of Chrome. The script runs and te result looks like this:0 Da wo der Pfeffi wächst ... https://alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... https://alando-palais.de/wp/wp-content/uploads... 2 Über 40 Party ... https://alando-palais.de/wp/wp-content/uploads... I have to check how that works. Why the dates are "..." and so on. Thanks a lot. The scipt is a great start. – Xenobiologist Mar 11 '19 at 09:30
  • I will have a look. Are you saying dates are all empty? – QHarr Mar 11 '19 at 09:55
  • The selenium script retunrs: title ... imageLink 0 Da wo der Pfeffi wächst ... https://alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... https://alando-palais.de/wp/wp-content/uploads... 2 ... Über 40 Party ... https://alando-palais.de/wp/wp- 9 Uni Royal ... https://alando-palais.de/wp/wp-content/uploads... [10 rows x 4 columns] – Xenobiologist Mar 11 '19 at 10:13
  • That looks quite good. The BS4 script (first one of your post) doesn't return results – Xenobiologist Mar 11 '19 at 10:16
  • I am guessing an access key has expired which would mean trying to use session and perhaps an additional get to try and obtain it – QHarr Mar 11 '19 at 10:19
1

I better recommend you selenium to bypass all the server restrictions.

Edited

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://alando-palais.de/events")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print elem.get_attribute("href")
Pablo Martinez
  • 449
  • 2
  • 6