2

I am trying to scrabe distance in km and time from google maps. However, when I run my code. the result is 2 emty brackets. like these: [] []

What am I doing wrong? You can see my code below. bwt, I am using the 3.5.1 version of python. I hope you can help me. Thanks Anna.

import urllib.request

import re

import ssl


url2 = "https://www.google.dk/maps/dir/Aarhus+Kommune/Horsens+Municipality/@56.0321212,9.6926376,10z/am=t/data=!4m17!4m16!1m5!1m1!1s0x464c4cb9541ed4a9:0xe58661230cfb55d!2m2!1d10.1373728!2d56.1683931!1m5!1m1!1s0x464c721bbef053d9:0xd089bdc7f76375ab!2m2!1d9.7844165!2d55.9267709!2m3!6e1!7e2!8j1465804800"

context1 = ssl.SSLContext(ssl.PROTOCOL_TLSv1)

htmlfile = urllib.request.urlopen(url2, context=context1)

htmltext = htmlfile.read()

regex = b'<span jstcache="1146">(.+?) km</span>'

regex2 = b'<span jstcache="1145" class="delay-light" jsan="7.delay-light">(.+?)</span>'

pattern = re.compile(regex)

pattern2 = re.compile(regex2)

distance_km = re.findall(pattern,htmltext)

distance_time = re.findall(pattern2,htmltext)

print(distance_km)
print(distance_time)
  • Sorry, I don't know why my regexs suddenly lokkes different. They are: regex = b'(.+?) km' regex2 = b'(.+?)' – Anna Hviid Heickendorff Jun 06 '16 at 11:40
  • 3
    [Don't parse HTML with regex](http://stackoverflow.com/a/1732454/2482744). Share the actual version of Python as show by `python --version`, don't just say 'latest'. If something is wrong with the question then edit it, don't comment. – Alex Hall Jun 06 '16 at 11:44
  • 1
    Make your life easier: https://developers.google.com/maps/documentation/distance-matrix/intro – Vasili Syrakis Jun 06 '16 at 11:55
  • The string ` – Simon Fraser Jun 06 '16 at 11:56
  • @AnnaHviidHeickendorff while there are other choices for parsing html, the simple way to make your code work is to fix the regex pattern string, please check the answer below. – lulyon Jun 09 '16 at 17:53

1 Answers1

0

In Python Regexp, the "<>." characters are of special use, also called regexp meta chacter. So when using those character as the original character, make it escaped with a \ ahead first.

So, the following expression regex and regex2:

regex = b'<span jstcache="1146">(.+?) km</span>'

regex2 = b'<span jstcache="1145" class="delay-light" jsan="7.delay-light">(.+?)</span>'

should be:

regex = b'\<span jstcache="1146">(.+?) km\</span\>'

regex2 = b'\<span jstcache="1145" class="delay-light" jsan="7\.delay-light">(.+?)\</span\>'
lulyon
  • 6,707
  • 7
  • 32
  • 49