How to scrape latitude longitude in beautiful soup

Question

I am fairly new to BeautifulSoup4 and am having trouble extracting latitude and longitude values out of an html response from the below code.

url = 'http://cinematreasures.org/theaters/united-states?page=1' 
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll("tr")
print links

This code prints out this response multiple times.

<tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965}

Full tr response

<tr>\n
  <th id="theater_name"><a href="/theaters/united-states?sort=name&amp;order=desc">\u2191 Name</a>
  </th>\n
  <th id="theater_location"><a href="/theaters/united-states?sort=location&amp;order=asc">Location</a>
  </th>\n
  <th id="theater_status"><a href="/theaters/united-states?sort=open&amp;order=desc">Status</a>
  </th>\n
  <th id="theater_screens"><a href="/theaters/united-states?sort=screens&amp;order=asc">Screens</a>
  </th>\n</tr>,
<tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965}, category: 'open'}">\n
  <td class="name">\n
    <a class="map-link" href="/theaters/8775">
      <img alt="112 Drive-In" height="48" src="http://photos.cinematreasures.org/production/photos/22137/1313612883/thumb.JPG?1313612883" width="48" />
    </a>\n<a class="map-link" href="/theaters/8775">112 Drive-In</a>\n
    <div class="info-box">\n
      <div class="photo" style="float: left;">
        <a href="/theaters/8775">
          <img alt="thumb" height="48" src="http://photos.cinematreasures.org/production/photos/22137/1313612883/thumb.JPG?1313612883" width="48" />
        </a>
      </div>\n
      <p style="min-width: 200px !important;">\n<strong><a href="/theaters/8775">112 Drive-In</a></strong>\n
        <br>\n 3352 Highway 112 North
        <br>Fayetteville, AR 72702
        <br>United States
        <br>479.442.4542
        <br>\n</br>
        </br>
        </br>
        </br>
        </br>
      </p>\n</div>\n</td>\n
  <td class="location">\n Fayetteville, AR, United States\n</td>\n
  <td class="status">\n Open\n</td>\n
  <td class="screens">\n 1\n</td>\n</tr>

How would I go about getting just the lng and lat values out of this response?

Thank you in advance.

Can you give us the URL that you're trying to scrape? Or at least the full content of the ``? — wpercy, Feb 25 '16 at 21:38
On top of what @wilbur said you'll need to use regex to grab the individual values from the table row in the example provided. — bmcculley, Feb 25 '16 at 21:40
When attempting to use regex to parse HTML, I always refer to this post. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Wondercricket, Feb 25 '16 at 21:48
@wilbur oops, I wasn't clear about what I was saying. I didn't mean to suggest using regex to parse the html. I was thinking to use BeautifulSoup to grab the data in the data tag and then use regex from there. — bmcculley, Feb 26 '16 at 01:11

score 2 · Answer 1 · answered Feb 25 '16 at 22:37

Here is my approach:

import requests
import demjson
from bs4 import BeautifulSoup

url = 'http://cinematreasures.org/theaters/united-states?page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text)

to_plain_coord = lambda d: (d['point']['lng'], d['point']['lat'])
# Grabbing theater coords if `data` attribute exists
coords = [
    to_plain_coord(demjson.decode(t.attrs['data']))
    for t in soup.select('.theater')
    if 'data' in t.attrs]

print(coords)

I don't use any string manipulations. Instead I load JSON from data attribute. Unfortunately it's not quite valid JSON here, so I'm using demjson library for json parsing.

pip install demjson

Nice! I hadn't heard of demjson before, I really like this solution. — bmcculley, Feb 26 '16 at 01:16

score 1 · Accepted Answer · edited May 23 '17 at 12:23

1

Okay, so you grab all the <tr>s correctly, now we just need to get the data attribute from each of them.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://cinematreasures.org/theaters/united-states?page=1' 
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
theaters = soup.findAll("tr", class_="theater")
data = [ t.get('data') for t in theaters if t.get('data') ]
print data

Unfortunately this gives you a list of strings, not a dictionary object like one might've hoped for. We can use regular expressions on the data strings to convert them to dicts (thanks RootTwo):

coords = []
for d in data:
    c = dict(re.findall(r'(lat|lng):\s*(-?\d{1,3}\.\d+)', d))
    coords.append(c)

edited May 23 '17 at 12:23

Community

1
1

answered Feb 25 '16 at 21:56

wpercy

9,636
4
33
45

Yea, this is nice, but a dictionary would be ideal. Thanks Wilbur. – sbell423 Feb 25 '16 at 22:02
1

`dict(re.findall(r'(lat|lng):\s*(-?\d{1,3}\.\d+)', data))` will return a dict. – RootTwo Feb 26 '16 at 07:30
@RootTwo _THANK YOU_! I'm not a talented enough regexer, but that's exactly right. – wpercy Feb 26 '16 at 13:33
@sbell423 this is a much nicer way to get a list of dictionaries as a result – wpercy Feb 26 '16 at 13:35
@RootTwo Thank you so much guys, this is exactly the result I was looking to get! Great work! – sbell423 Feb 26 '16 at 13:44

score -1 · Answer 3 · answered Feb 25 '16 at 21:54

-1

If you're expecting only a single response do:

print links[0]

answered Feb 25 '16 at 21:54

rye

487
1
5
15

How to scrape latitude longitude in beautiful soup

3 Answers3

Linked