2

I would like to use web-scraping to extract information on listing on the student accomodation website uniplaces. Here's an exemplary listing: https://www.uniplaces.com/accommodation/berlin/92342

I would like to extract information such as price, # bathrooms, # roommates,...

However, using different approaches I found online, I have not been able to extract the full html code. There are always sub-sections missing, that include the relevant information. On the website you can open these subsections with a little arrow. I am new to html so I don't understand why this cannot be pulled.

Here's the codes I have tried:

from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('headless')
driver= webdriver.Chrome(chrome_options=options,executable_path=r'path/chromedriver.exe')
driver.get('https://www.uniplaces.com/accommodation/berlin/92342')

html_doc = driver.page_source
soup= BeautifulSoup(html_doc,'lxml')
print (soup.prettify())

and variations of this:

import urllib.request
fp= urllib.request.urlopen("https://www.uniplaces.com/accommodation/berlin/92342")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

If anyone can help with this, I would highly appreciate any tips & tricks!

All the best, Hannah

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
  • the "missing" bits are likely to be things loaded via AJAX requests and then inserted into the HTML after the main page has been loaded (this kind of thing is usually done in response to some user action, as you mentioned. It saves having to refresh the entire page just to update one little part). Since you're just downloading the initial version as a HTML document, there's of course no opportunity for you to cause the extra bits to be loaded. BTW I don't see how this has anything to do with JSON, I will remove that tag. – ADyson Nov 01 '18 at 13:36

1 Answers1

1

This site uses an internal GraphQL API accessible from

https://offer-aggregate-graphql.uniplaces.com/graphql

GraphQL is a query language which lets you choose which fields you want to query. This would be very handy for you since you probably want to access specific info as you suggested in your question.

The following example query for the offer price, the conditions (including max people) and type of accomodation (area, number of bedroom and bathroom) :

import requests

id = "92342"

query = """
    query($id: ID!) { 
        offerAggregate(id: $id) { 
            accommodation_offer {
                reference_price {
                    amount
                    currency_code
                }
                requisites {
                    conditions {
                        cancellation_policy
                        minimum_nights
                        max_guests
                    }
                }
            }
            property_aggregate {
                property {
                    typology {
                        area
                        number_of_bedrooms
                        number_of_bathrooms
                    }
                }
            }
        } 
    }
"""

resp = requests.post(
    'https://offer-aggregate-graphql.uniplaces.com/graphql', 
    json={
        "query": query,
        "variables": {
            "id": id
        }
    }
)

body = resp.json()

print(body)

You can learn more about GraphQL queries here

The initial request that is used in the offer page is quite big, you would need to select only the fields you want to query. Here is the query using :

curl 'https://offer-aggregate-graphql.uniplaces.com/graphql' \
     -H 'content-type: application/json' \
     --data-binary '{"query":"fragment PhotosFragment on Photos {\n  id\n  hash\n  placeholder\n  metadata {\n    internal_label\n    __typename\n  }\n  __typename\n}\n\nfragment PropertyLocationFragment on PropertyLocation {\n  neighborhood_id\n  geo {\n    latitude\n    longitude\n    __typename\n  }\n  address {\n    street\n    city_code\n    number\n    postal_code\n    extra\n    __typename\n  }\n  __typename\n}\n\nfragment PropertyAggregateFragment on PropertyAggregate {\n  property {\n    id\n    external_reference {\n      human_reference\n      api_reference\n      __typename\n    }\n    landlord_resident {\n      gender\n      age_range\n      occupation\n      pets\n      family\n      __typename\n    }\n    features {\n      Code\n      Exists\n      __typename\n    }\n    floors {\n      units {\n        id\n        area\n        photos {\n          id\n          displayable\n          __typename\n        }\n        features {\n          Code\n          Exists\n          __typename\n        }\n        subunits {\n          id\n          type_code\n          features {\n            Code\n            Exists\n            __typename\n          }\n          photos {\n            id\n            displayable\n            __typename\n          }\n          __typename\n        }\n        type_code\n        __typename\n      }\n      __typename\n    }\n    lifecycle {\n      rent_by\n      out_of_platform {\n        out\n        __typename\n      }\n      __typename\n    }\n    location {\n      ...PropertyLocationFragment\n      __typename\n    }\n    main_features {\n      gas_type\n      __typename\n    }\n    metadata {\n      locale_code\n      text\n      main\n      __typename\n    }\n    photos {\n      id\n      displayable\n      __typename\n    }\n    restrictions {\n      occupation\n      origin\n      __typename\n    }\n    rules {\n      code\n      exists\n      __typename\n    }\n    typology {\n      area\n      accommodation_type_code\n      type_code\n      number_of_bedrooms\n      number_of_bathrooms\n      __typename\n    }\n    verification {\n      verified\n      __typename\n    }\n    video {\n      url\n      __typename\n    }\n    __typename\n  }\n  neighborhood {\n    id\n    city_code\n    slug\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferBillFragment on AccommodationOfferBill {\n  included\n  maximum {\n    ...AccommodationOfferBillMaximumFragment\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferBillMaximumFragment on AccommodationOfferBillMaximum {\n  capped\n  max {\n    amount\n    currency_code\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferCostsFragment on AccommodationOfferCosts {\n  bills {\n    maximum {\n      ...AccommodationOfferBillMaximumFragment\n      __typename\n    }\n    water {\n      ...AccommodationOfferBillFragment\n      __typename\n    }\n    electricity {\n      ...AccommodationOfferBillFragment\n      __typename\n    }\n    gas {\n      ...AccommodationOfferBillFragment\n      __typename\n    }\n    internet {\n      ...AccommodationOfferBillFragment\n      __typename\n    }\n    __typename\n  }\n  services {\n    cleaning {\n      periodicity\n      type\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferPropertyFragment on AccommodationOfferProperty {\n  unitary\n  number_of_units\n  property_id\n  unit_id\n  photos_unit_id\n  subunit_id\n  __typename\n}\n\nfragment AccommodationOfferContractOptionFragment on AccommodationOfferContractOption {\n  id\n  start_date\n  end_date\n  contract_value {\n    amount\n    currency_code\n    __typename\n  }\n  instalments {\n    date\n    value {\n      amount\n      currency_code\n      __typename\n    }\n    __typename\n  }\n  number_of_instalments\n  __typename\n}\n\nfragment AccommodationOfferContractStandardFragment on AccommodationOfferContractStandard {\n  extra_after\n  penalty {\n    nights_threshold\n    type\n    percentage\n    value {\n      amount\n      currency_code\n      __typename\n    }\n    __typename\n  }\n  extra_per_guest {\n    amount\n    currency_code\n    __typename\n  }\n  rents {\n    amount\n    currency_code\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferContractFragment on AccommodationOfferContract {\n  type\n  exclusive\n  is_instant_booking\n  commission\n  deposit {\n    pay_to\n    type\n    value {\n      amount\n      currency_code\n      __typename\n    }\n    __typename\n  }\n  admin_fee {\n    exact_value\n    value {\n      amount\n      currency_code\n      __typename\n    }\n    __typename\n  }\n  variable_admin_fee {\n    default_admin_fee {\n      exact_value\n      value {\n        amount\n        currency_code\n        __typename\n      }\n      __typename\n    }\n    levels {\n      exact_value\n      value {\n        amount\n        currency_code\n        __typename\n      }\n      until\n      __typename\n    }\n    __typename\n  }\n  fixed {\n    options {\n      ...AccommodationOfferContractOptionFragment\n      __typename\n    }\n    __typename\n  }\n  fixed_unitary {\n    options {\n      ...AccommodationOfferContractOptionFragment\n      __typename\n    }\n    extra_after\n    extra_per_guest {\n      amount\n      currency_code\n      __typename\n    }\n    __typename\n  }\n  standard {\n    ...AccommodationOfferContractStandardFragment\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferRequisitesFragment on AccommodationOfferRequisites {\n  requirements {\n    offline_id\n    guarantor\n    contract\n    __typename\n  }\n  conditions {\n    cancellation_policy\n    minimum_nights\n    max_guests\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferTitleFragment on AccommodationOfferTitle {\n  locale_code\n  text\n  main\n  __typename\n}\n\nfragment AccommodationOfferAvailabilityFragment on AccommodationOfferAvailability {\n  standard_unitary_contract {\n    available_from\n    last_updated_at\n    __typename\n  }\n  standard_contract {\n    available_from\n    last_updated_at\n    __typename\n  }\n  fixed_contract {\n    available_from\n    last_updated_at\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferAvailabilitiesStandardFragment on AccommodationOfferAvailabilitiesStandard {\n  available_periods {\n    start_date\n    end_date\n    __typename\n  }\n  years {\n    year\n    months {\n      Jan\n      Feb\n      Mar\n      Apr\n      May\n      Jun\n      Jul\n      Aug\n      Sep\n      Oct\n      Nov\n      Dec\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferAvailabilitiesStandardUnitaryFragment on AccommodationOfferAvailabilitiesStandardUnitary {\n  available_periods {\n    start_date\n    end_date\n    __typename\n  }\n  blocked_intervals {\n    start_date\n    end_date\n    by\n    extra_info\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferAvailabilitiesFixedFragment on AccommodationOfferAvailabilitiesFixed {\n  options {\n    id\n    status\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferAvailabilitiesFragment on AccommodationOfferAvailabilities {\n  standard {\n    ...AccommodationOfferAvailabilitiesStandardFragment\n    __typename\n  }\n  standard_unitary {\n    ...AccommodationOfferAvailabilitiesStandardUnitaryFragment\n    __typename\n  }\n  fixed {\n    ...AccommodationOfferAvailabilitiesFixedFragment\n    __typename\n  }\n  fixed_unitary {\n    ...AccommodationOfferAvailabilitiesFixedFragment\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationOfferFragment on AccommodationOffer {\n  id\n  version\n  parent\n  accommodation_provider_id\n  property {\n    ...AccommodationOfferPropertyFragment\n    __typename\n  }\n  title {\n    ...AccommodationOfferTitleFragment\n    __typename\n  }\n  costs {\n    ...AccommodationOfferCostsFragment\n    __typename\n  }\n  requisites {\n    ...AccommodationOfferRequisitesFragment\n    __typename\n  }\n  availability_summary_info {\n    ...AccommodationOfferAvailabilityFragment\n    __typename\n  }\n  availabilities {\n    ...AccommodationOfferAvailabilitiesFragment\n    __typename\n  }\n  lifecycle {\n    published {\n      published\n      __typename\n    }\n    __typename\n  }\n  restrictions {\n    gender\n    occupancy\n    __typename\n  }\n  contract {\n    ...AccommodationOfferContractFragment\n    __typename\n  }\n  floor_plan {\n    name\n    __typename\n  }\n  main_photo {\n    id\n    __typename\n  }\n  reference_price {\n    amount\n    currency_code\n    __typename\n  }\n  __typename\n}\n\nfragment AccommodationProviderFragment on AccommodationProvider {\n  id\n  booking {\n    gap_on_booking {\n      soft_maximum\n      hard_maximum\n      __typename\n    }\n    __typename\n  }\n  verifications {\n    email_address\n    phone\n    offline_id\n    __typename\n  }\n  basic_info {\n    preference_settings {\n      locale_code\n      __typename\n    }\n    __typename\n  }\n  account_management {\n    key_account\n    __typename\n  }\n  stats {\n    bookings {\n      accepted {\n        total\n        __typename\n      }\n      requested {\n        total\n        __typename\n      }\n      rejected {\n        total\n        __typename\n      }\n      confirmed {\n        total\n        __typename\n      }\n      __typename\n    }\n    response_time\n    __typename\n  }\n  created {\n    at\n    __typename\n  }\n  __typename\n}\n\nfragment GlobalizationCityFragment on GlobalizationCity {\n  code\n  configuration {\n    slug\n    __typename\n  }\n  metadata {\n    name_translations {\n      locale_code\n      text\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n\nfragment GlobalizationCountryFragment on GlobalizationCountry {\n  code\n  metadata {\n    name_translations {\n      locale_code\n      text\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n\nfragment GlobalizationAggregateFragment on GlobalizationAggregate {\n  city {\n    ...GlobalizationCityFragment\n    __typename\n  }\n  country {\n    ...GlobalizationCountryFragment\n    __typename\n  }\n  __typename\n}\n\nquery offerAggregate($id: ID!, $useCache: Boolean) {\n  offerAggregate(id: $id, useCache: $useCache) {\n    id\n    units_sorted {\n      unit_id\n      __typename\n    }\n    photos {\n      ...PhotosFragment\n      __typename\n    }\n    property_aggregate {\n      ...PropertyAggregateFragment\n      __typename\n    }\n    accommodation_offer {\n      ...AccommodationOfferFragment\n      __typename\n    }\n    accommodation_provider {\n      ...AccommodationProviderFragment\n      __typename\n    }\n    globalization_aggregate {\n      ...GlobalizationAggregateFragment\n      __typename\n    }\n    __typename\n  }\n}\n","variables":{"id":"92342"},"operationName":"offerAggregate"}'
Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
  • Dear Bertrand, I have one more question: how were you able to extract the concrete field structure (field names, field hierarchy) of uniplace's GraphQL in order to query the relevant fields? From how I understand your code, you were already aware of the field structure when writing the curl. – HannahKorts Nov 01 '18 at 15:02
  • @HannahKorts that's the last part of the answer, the curl query is the original one with all the fields and fragments, you can also view the full request if you open developer tools and look for the POST on the graphql endpoint – Bertrand Martel Nov 01 '18 at 15:04
  • @HannahKorts in fact the curl request I've posted is just a "copy as curl" from the network tab in developer tool removing some headers – Bertrand Martel Nov 01 '18 at 15:06
  • @HannahKorts note that the original request has a lot of [graphql fragments](https://graphql.org/learn/queries/#fragments) which are placeholder for sub-fields. I've removed the fragments from the query and just used the fields from the fragments to make it more clear – Bertrand Martel Nov 01 '18 at 15:09
  • Thanks Bertrand, I understand the logic behind it. I am, however, wondering how I could show the schema in an organized manner instead of a string using python (or would I need a developer tool?). Sorry for the many questions and thanks in advance for your time! – HannahKorts Nov 01 '18 at 15:44
  • to prettify the query you can use [this](https://graphql.github.io/swapi-graphql/), replace "\n" with spaces in the query field and copy paste in this tool and click prettify – Bertrand Martel Nov 01 '18 at 15:48
  • @HannahKorts use [this query](https://stackoverflow.com/a/46543255/2614364) to get the full schema – Bertrand Martel Nov 01 '18 at 15:55
  • Dear Bertrand, I was able to truly advance my project based on your code. Now that I am working with a dataframe and trying to normalize the json file (it is very nested) I am often receiving an error, since the resp.json() is a 'dict' object, and thus commands such as pd.read_json cannot be executed. Is there a way to get the response as a true json file instead of a Python object (dict)? All the best, hannah – HannahKorts Nov 06 '18 at 18:20
  • @HannahKorts import json and json.dumps(body) – Bertrand Martel Nov 08 '18 at 01:17