How to parse information with python from a webpage that uses php and javascript

Question

I'm trying to get all the events and additional metadata to those events from this webpage : https://alando-palais.de/events

My problem is, that the result(html) doesn't contain the information I'm looking for. I guess, they are "hidden" behind some php script. This url: 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

Any idea, on how to wait until the page is completely loaded, or what kind of method do I have to use, to get the event information?

This is my script right now :-) :

from bs4 import BeautifulSoup
from urllib.request import urlopen, urljoin
from urllib.parse import urlparse
import re
import requests

if __name__ == '__main__':
    target_url = 'https://alando-palais.de/events'
    #target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

    soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')
    print(soup)

    links = soup.find_all('a', href=True)
    for x,link in enumerate(links):
        print(x, link['href'])


#    for image in images:
#        print(urljoin(target_url, image))

Expected output would be something like:

Date: 08.03.2019
Title: Penthouse Club Special: Maiwai & Friends
img: https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg"

That's something out of this result:

<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">
    <a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai &#038; Friends" class="vc_gitem-link vc-zone-link"></a>    <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt="">  <div class="vc_gitem-zone-mini">
        <div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left">   <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019
    </div>

what's the expected output? One example perhaps as I'm unsure what you mean by metadata. — QHarr, Mar 08 '19 at 20:55
if php requests are fired off by javascript, the results will not be available when the base page is loaded, you would have to render it to have the data calls made.. maybe use selenium to render the results and then get the final page out when it is done. — danchik, Mar 08 '19 at 21:13
Yeah, I thought about using selenium or pyqt to emulate a "real" browser. Could you provide a few lines to get me started with selenium? — Xenobiologist, Mar 08 '19 at 21:15
this has some info https://stackoverflow.com/questions/29404856/how-can-i-render-javascript-html-to-html-in-python — danchik, Mar 08 '19 at 21:15

QHarr · Accepted Answer · 2019-03-09T06:06:30.580

You could mimic the xhr post made by the page

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

data = {

  'action': 'vc_get_vc_grid_data',
  'vc_action': 'vc_get_vc_grid_data',
  'tag': 'vc_basic_grid',
  'data[visible_pages]' : 5,
  'data[page_id]' : 30,
  'data[style]' : 'all',
  'data[action]' : 'vc_get_vc_grid_data',
  'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',
  'data[tag]' : 'vc_basic_grid',
  'vc_post_id' : '30',
  '_vcnonce' : 'cc8cc954a4'  

}

res = requests.post(url, data = data)
soup = BeautifulSoup(res.content, 'lxml')
dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]
textInfo = [item for item in soup.select('.vc_gitem-link')][::2]
imageLinks = [item['src'].strip() for item in soup.select('img')]
titles = []
links = []
for item in textInfo:
    titles.append(item['title'])
    links.append(item['href'])
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])
print(results)

Or with selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://alando-palais.de/events#'
driver = webdriver.Chrome()
driver.get(url)

dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]
textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]
textInfo = textInfo[: int(len(textInfo) / 2)]
imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]
titles = []
links = []

for item in textInfo:
    titles.append(item.get_attribute('title'))
    links.append(item.get_attribute('href'))
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])

print(results)

driver.quit()

Thanks a lot. The first script returns: Empty DataFrame Columns: [title, date, link, imageLink] Index: [] The other run into errors. I guess, I have to setup selenium first. — Xenobiologist, Mar 09 '19 at 12:31
Thanks. I did the setup and changed the code above to use Firefox in lieu of Chrome. The script runs and te result looks like this:0 Da wo der Pfeffi wächst ... https://alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... https://alando-palais.de/wp/wp-content/uploads... 2 Über 40 Party ... https://alando-palais.de/wp/wp-content/uploads... I have to check how that works. Why the dates are "..." and so on. Thanks a lot. The scipt is a great start. — Xenobiologist, Mar 11 '19 at 09:30
The selenium script retunrs: title ... imageLink 0 Da wo der Pfeffi wächst ... https://alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... https://alando-palais.de/wp/wp-content/uploads... 2 ... Über 40 Party ... https://alando-palais.de/wp/wp- 9 Uni Royal ... https://alando-palais.de/wp/wp-content/uploads... [10 rows x 4 columns] — Xenobiologist, Mar 11 '19 at 10:13
That looks quite good. The BS4 script (first one of your post) doesn't return results — Xenobiologist, Mar 11 '19 at 10:16
I am guessing an access key has expired which would mean trying to use session and perhaps an additional get to try and obtain it — QHarr, Mar 11 '19 at 10:19

Pablo Martinez · Answer 2 · 2019-03-08T21:58:16.450

1

I better recommend you selenium to bypass all the server restrictions.

Edited

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://alando-palais.de/events")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print elem.get_attribute("href")

edited Mar 08 '19 at 21:58

answered Mar 08 '19 at 21:15

Pablo Martinez

449
2
6

How would a small starting script look like? – Xenobiologist Mar 08 '19 at 21:17
Thanks, this script shows exactly the same infos like my starting script. I ' ll have to dig deeper into these techniques. – Xenobiologist Mar 11 '19 at 09:35

How to parse information with python from a webpage that uses php and javascript

2 Answers2