1

I am fairly new to BeautifulSoup4 and am having trouble extracting latitude and longitude values out of an html response from the below code.

url = 'http://cinematreasures.org/theaters/united-states?page=1' 
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll("tr")
print links

This code prints out this response multiple times.

<tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965}

Full tr response

<tr>\n
  <th id="theater_name"><a href="/theaters/united-states?sort=name&amp;order=desc">\u2191 Name</a>
  </th>\n
  <th id="theater_location"><a href="/theaters/united-states?sort=location&amp;order=asc">Location</a>
  </th>\n
  <th id="theater_status"><a href="/theaters/united-states?sort=open&amp;order=desc">Status</a>
  </th>\n
  <th id="theater_screens"><a href="/theaters/united-states?sort=screens&amp;order=asc">Screens</a>
  </th>\n</tr>,
<tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965}, category: 'open'}">\n
  <td class="name">\n
    <a class="map-link" href="/theaters/8775">
      <img alt="112 Drive-In" height="48" src="http://photos.cinematreasures.org/production/photos/22137/1313612883/thumb.JPG?1313612883" width="48" />
    </a>\n<a class="map-link" href="/theaters/8775">112 Drive-In</a>\n
    <div class="info-box">\n
      <div class="photo" style="float: left;">
        <a href="/theaters/8775">
          <img alt="thumb" height="48" src="http://photos.cinematreasures.org/production/photos/22137/1313612883/thumb.JPG?1313612883" width="48" />
        </a>
      </div>\n
      <p style="min-width: 200px !important;">\n<strong><a href="/theaters/8775">112 Drive-In</a></strong>\n
        <br>\n 3352 Highway 112 North
        <br>Fayetteville, AR 72702
        <br>United States
        <br>479.442.4542
        <br>\n</br>
        </br>
        </br>
        </br>
        </br>
      </p>\n</div>\n</td>\n
  <td class="location">\n Fayetteville, AR, United States\n</td>\n
  <td class="status">\n Open\n</td>\n
  <td class="screens">\n 1\n</td>\n</tr>

How would I go about getting just the lng and lat values out of this response?

Thank you in advance.

sbell423
  • 103
  • 10
  • 1
    Can you give us the URL that you're trying to scrape? Or at least the full content of the ``? – wpercy Feb 25 '16 at 21:38
  • On top of what @wilbur said you'll need to use regex to grab the individual values from the table row in the example provided. – bmcculley Feb 25 '16 at 21:40
  • I edited the original post, regex is the only way? – sbell423 Feb 25 '16 at 21:42
  • 2
    No! Don't use regex. I'm writing an answer now – wpercy Feb 25 '16 at 21:44
  • 2
    When attempting to use regex to parse HTML, I always refer to this post. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Wondercricket Feb 25 '16 at 21:48
  • @wilbur oops, I wasn't clear about what I was saying. I didn't mean to suggest using regex to parse the html. I was thinking to use BeautifulSoup to grab the data in the data tag and then use regex from there. – bmcculley Feb 26 '16 at 01:11

3 Answers3

2

Here is my approach:

import requests
import demjson
from bs4 import BeautifulSoup

url = 'http://cinematreasures.org/theaters/united-states?page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text)

to_plain_coord = lambda d: (d['point']['lng'], d['point']['lat'])
# Grabbing theater coords if `data` attribute exists
coords = [
    to_plain_coord(demjson.decode(t.attrs['data']))
    for t in soup.select('.theater')
    if 'data' in t.attrs]

print(coords)

I don't use any string manipulations. Instead I load JSON from data attribute. Unfortunately it's not quite valid JSON here, so I'm using demjson library for json parsing.

pip install demjson
irvind
  • 88
  • 1
  • 8
1

Okay, so you grab all the <tr>s correctly, now we just need to get the data attribute from each of them.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://cinematreasures.org/theaters/united-states?page=1' 
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
theaters = soup.findAll("tr", class_="theater")
data = [ t.get('data') for t in theaters if t.get('data') ]
print data 

Unfortunately this gives you a list of strings, not a dictionary object like one might've hoped for. We can use regular expressions on the data strings to convert them to dicts (thanks RootTwo):

coords = []
for d in data:
    c = dict(re.findall(r'(lat|lng):\s*(-?\d{1,3}\.\d+)', d))
    coords.append(c)
Community
  • 1
  • 1
wpercy
  • 9,636
  • 4
  • 33
  • 45
-1

If you're expecting only a single response do:

print links[0]
rye
  • 487
  • 1
  • 5
  • 15