0

Ok so I am try to make a script (for my own amusement) that will look through the results of a Kayak.co.uk query and output it using a python script. I am using urllib to grab the content of the webpage query result (example = https://www.kayak.co.uk/flights/DUB-LAX/2018-06-04/2018-06-25/2adults?sort=bestflight_a). However, I need a regular expression to find the prices in £. I have no tried much (because I'am not very good at regular expressions). ALSO does urllib retrieve the JS as well as HTML? I am know that some of the information that I need is included within the JS. Any help would be much appreciated.

This is what I have so far:

def urlRead(url):
    """Gets and returns the content of the chosen URL"""
    webpage = urllib.request.urlopen(url) 
    page_contents = webpage.read() 
    return page_contents
def getPrices(content):
    content = re.findall(r'£435', content.decode())
    print(content)

def main():
    page_contents = ''
    url = input('Please enter in the kayak url!: ')
    content = urlRead(url)
    getPrices(content)


if __name__ == '__main__':
    main()
  • 1
    Related: https://stackoverflow.com/q/1732348 – Mr Lister Dec 11 '17 at 20:00
  • I'd suggest checking out [this page](http://regexr.com) for regex help & experimentation. – Maximilian Burszley Dec 11 '17 at 20:00
  • The web request will only retrieve the initial result of the web request, which will likely just be the HTML for the page. You can theoretically parse the HTML to locate the references to the JS files and also load those, but in all likelihood you will actually need to _execute_ the JS to get the information you want. What you probably want to do is load the page in a headless browser like phantomjs rather than trying to do this with urllib – Hamms Dec 11 '17 at 20:05

1 Answers1

0

As @Mr Lister said, you should not try to parse HTML using Regular Expressions if you can avoid it. Beautiful Soup is an HTML parsing library that could help you do what you need:

response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar
Slpk
  • 23
  • 7