0

I am trying to scrape address from the below link:

https://www.yelp.com/biz/rollin-phatties-houston

But I am getting only the first value of the address (i.e.: 1731 Westheimer Rd) out of complete address which is separated by a comma:

1731 Westheimer Rd, Houston, TX 77098

Can anyone help me out in this, please find my code below:

import bs4 as bs
import urllib.request as url

source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')

mains = soup.find_all("div", {"class": "secondaryAttributes__09f24__3db5x arrange-unit__09f24__1gZC1 border-color--default__09f24__R1nRO"})
main = mains[0] #First item of mains

address = []
for main in mains:
    try:       
        address.append(main.address.find("p").text)
    except:
        address.append("")

print(address)
# 1731 Westheimer Rd
Goutam
  • 377
  • 1
  • 2
  • 11
  • 2
    See how to create a [mcve]. The URL doesn't matter, just the content of the HTML. Make the example as small as possible, and still exhibit the problem you can't solve. – Peter Wood Dec 19 '20 at 01:09

3 Answers3

2
import requests
import re
from ast import literal_eval


def main(url):
    r = requests.get(url)
    match = literal_eval(
        re.search(r'addressLines.+?(\[.+?])', r.text).group(1))
    print(*match)


main('https://www.yelp.com/biz/rollin-phatties-houston')

Output:

1731 Westheimer Rd Houston, TX 77098
1

There is no need to find the address information by inspecting the element, actually, the data inside a javascript tag element is passed onto the page already. You can get it by the following code

import chompjs
import bs4 as bs
import urllib.request as url

source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')

javascript = soup.select("script")[16].string
data = chompjs.parse_js_object(javascript)
data['bizDetailsPageProps']['bizContactInfoProps']['businessAddress']
Jerry An
  • 1,077
  • 10
  • 18
  • here is another example to show how to parse Javascript objects into a dict. https://stackoverflow.com/a/65272779/10153574 – Jerry An Dec 19 '20 at 02:56
1

The business address that is shown on the webpage is generated dynamically. If you view Page Source of the URL, you will find that the address of the restaurant is stored in a script element. So you need to extract the address from it.

from bs4 import BeautifulSoup
import requests
import json
page = requests.get('https://www.yelp.com/biz/rollin-phatties-houston')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script', attrs={'type':'application/json'})
scriptcontent = scriptelements[2].text
scriptcontent = scriptcontent.replace('<!--', '')
scriptcontent = scriptcontent.replace('-->', '')
jsondata = json.loads(scriptcontent)
print(jsondata['bizDetailsPageProps']['bizContactInfoProps']['businessAddress'])

Using the above code, you will be able to extract the address of any business.