2

I want to scrape the text from the URL "http://www.nycgo.com/venues/thalia-restaurant#menu" The text I'm interested in is in the 'menu' tab on the page. I tried BeautifulSoup to get all the text on the page, but the return value from the following code misses all the text in the menu.

html = urllib2.urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html=html.read()
soup = BS(html)
print soup.get_text()

It seems that the content of the menu is part of the html on the page when I inspect elements from the menu content. I did notice that when physically browsing the page, it takes several seconds for the menu to fully load. Not sure if that's why the code above fails to get the menu content.

Any insight would be appreciated.

Camuslu
  • 123
  • 1
  • 3
  • 13
  • If there isn't any special reason this *has* to be done using a Python script, I'd suggest using [wkhtmltopdf](http://wkhtmltopdf.org/). – amphetamachine Jan 15 '16 at 20:54
  • 1
    The content for the page is dynamically loaded with Javascript. You're not going to be able to get all the content simply by downloading the HTML text. – jumbopap Jan 15 '16 at 21:10
  • 1
    @jumbopap thanks, i had the suspicion that something like that could be the reason the return value misses the menu content. Any suggestion how to deal with this? – Camuslu Jan 15 '16 at 21:22
  • @amphetamachine thanks, I tried the tool but the pdf created still misses the menu content :( – Camuslu Jan 15 '16 at 21:22

1 Answers1

8

While soup.get_text() will return all of the text from a HTML document (webpage) the problem here is that the menu is embedded in the page as a PDF, which Beautiful soup cannot access. The actual PDF file is defined in Javascript like follows:

{
    name: "menu",
    show: Boolean(1),
    url: "/assets/files/programs/rw/2016W/thalia-restaurant.pdf"
}

The simplest way to extract this then is probably to use regular expressions. While this is generally a bad idea, here you're looking for a very specific thing — a file, wrapped in "quotes" ending in .pdf. The following code will find that and extract the URL:

import re
from urllib import urlopen

html = urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html_doc = html.read()

match = re.search(b'\"(.*?\.pdf)\"', html_doc)
pdf_url = "http://www.nycgo.com" + match.group(1).decode('utf8')

Now pdf_url is:

u'http://www.nycgo.com/assets/files/programs/rw/2016W/thalia-restaurant.pdf'

However, extracting the text from the PDF is a little trickier. You can download the file first:

from urllib import urlretrieve
urlretrieve(pdf_url, "download.pdf")

Then extract the text as described using the function in this answer to another question:

text = convert_pdf_to_txt("download.pdf")
print(text)

Returns:

NEW YOUR CITY 
RESTAURANT WEEK

WINTER 2016

MONDAY - FRIDAY
828 Eighth Avenue
New York City, 10019

Tel: 212.399.4444

www.restaurantthalia.com

LUNCH $25
FIRST COURSE
CREAMY POLENTA
fricassee of truffle mushrooms

...
Community
  • 1
  • 1
mfitzp
  • 15,275
  • 7
  • 50
  • 70