1

I'm trying to get data and export to CSV which I have main URL page and second URL main page which I have imported the following of these:

from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse, parse_qs
import csv

def get_page(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    mainpage = response.read().decode('utf-8')
    return mainpage
mainpage = get_page(www.website1.com)
mainpage_parser = BeautifulSoup(mainpage,'html.parser')
secondpage = get_page('www.website2.com')
secondpage_parser = BeautifulSoup(secondpage,'html.parser')

The patterns of the data are the same such as Title, Address; thus, the code I use is "find" or "find_all" in each class; for example,

try:
    name = page_parser.find("h1",{"class":"xxx"}).find("a").get_text()
print(name)
except:
print(name)

Which it worked. However, I couldn't get the "lat" and "lon" from url link in this html class:

<img class="aaa" alt="map" data-track-id="static-map" width="97" height="142" src="https://www.website.com/aaaaaaa;height=284&amp;lat=18.111&amp;lon=98.111&amp;level=15&amp;returnImage=true">

The code I'm trying to get latitude and longitude is:

   for gps in secondpage_parser.find_all('img',{"class":"aaa"}, src=True):
      parsed_url = urlparse(gps['src'])
      mykeys = ['lat', 'lon']
      gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]
   print(gpslocation)

But it has Key Error on the "gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]" line which it indicates "KeyError: 'lat'"

I would like to know which part here I have the mistake or how should I fix it. Please help.

Meekao
  • 35
  • 1
  • 9
  • @StevenRumbalski Thank you so much. Since, I'm quite new to Python, do you mean to replace my print(gpslocation) to print(gps['src'], parse_qs(parsed_url.query)) ? I got the same error (Key error of 'lat'). – Meekao Sep 13 '18 at 15:44
  • @StevenRumbalski You are right. The full url doesn't contain any "?". So "query" is invalid for this case right? I'm not sure but if I don't get it wrong, do you mean I should try using string that suits to "&"? – Meekao Sep 13 '18 at 16:00

1 Answers1

1

This url has no query string but does have parameters (see what is the difference between URL parameters and query strings). So when you try to parse the query string you get an an empty dictionary. Hence the KeyError.

"https://www.website.com/aaaaaaa;height=284&amp;lat=18.111&amp;lon=98.111&amp;level=15&amp;returnImage=true"
#                               ^--- semicolon, not question mark

Result of print(parsed_url)

ParseResult(
    scheme='https', 
    netloc='www.website.com', 
    path='/aaaaaaa',
    params='height=284&amp;lat=18.111&amp;lon=98.111&amp;level=15&amp;returnImage=true',
    query='', 
    fragment='')

The key here is to parse the parameters. To fix your code change parsed_url.query to parsed_url.params:

gpslocation = [parse_qs(parsed_url.params)[k][0] for k in mykeys]
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • Thank you. Your answer really makes sense. However, I still have the problem. I've provided the result here with the full Url from the src part. https://docs.google.com/document/d/1h55yVubv4N0LHhGaVLVW7gfWp46hrKw9J-Mj02PkBnc/edit?usp=sharing It might help clarify the problem more. Sorry if providing in this .doc format may be disturbing. – Meekao Sep 13 '18 at 16:29
  • @Meekao: Yeah, I'm not going to open a .doc. – Steven Rumbalski Sep 13 '18 at 16:33
  • I do apologize if it was not appropriate. I find out that the key error existed as I didn't delete _all (I supposed to use .find , not .find_all). So now it follows your direction but the result shows as {}. Could you please suggest a way or any learning sources that I could try figure it out? – Meekao Sep 13 '18 at 17:10
  • @Meekao: I didn't know the answer when you asked the question. But I would consider adding a `print(parsed_url)` so you can see how the url has been parsed. That should give you an angle to start seeking answers. – Steven Rumbalski Sep 13 '18 at 17:17
  • I'll try. Thank you so much again. – Meekao Sep 13 '18 at 17:30