1

I want to fetch all hostPageDisplayUrl from the text file which i have. Few lines are given below

{"instrumentation": {"pageLoadPingUrl": "https://www.bingapis.com/api/ping/pageload?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&Type=Event.CPT&DATA=0"}, "_type": "Images", "displayRecipeSourcesBadges": true, "value": [{"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=QWSSSaNP6OdarVmpdZ2TGGupNBCF0-Ue_w2zKVqczwk&v=1&r=http%3a%2f%2fphotos.wikimapia.org%2fp%2f00%2f02%2f91%2f36%2f73_big.jpg&p=DevEx,5008.1", "accentColor": "2B3C71", "height": 375, "hostPageDisplayUrl": "wikimapia.org/1649944/Bahawalpur-Railway-Station", "name": "Bahawalpur Railway Station - Bahawalpur (\u0628\u06c1\u0627\u0648\u0644\u067e\u0648\u0631)", "width": 500, "imageId": "5464C96913992D44983D02E302F166C57BC6DA26", "imageInsightsToken": "ccid_CUojXAsn*mid_5464C96913992D44983D02E302F166C57BC6DA26*simid_608054236795568956", "datePublished": "2010-02-21T22:19:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=Fbz9jxTPMT44aF3aWlDgNwU7Zhr3qYbOco653N9vnIc&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d5464C96913992D44983D02E302F166C57BC6DA26%26simid%3d608054236795568956&p=DevEx,5006.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=MVElDiTqkKkcRJKEQxgr1yxRbwh-DpMNfT7lA6g1ivg&v=1&r=http%3a%2f%2fwikimapia.org%2f1649944%2fBahawalpur-Railway-Station&p=DevEx,5007.1", "thumbnailUrl": "https://tse1.mm.bing.net/th?id=OIP.CUojXAsnV5KRBVF6-RIlLwEsDh&pid=Api", "thumbnail": {"width": 300, "height": 225}, "contentSize": "38571 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=yrOFma0zG8eUzUVY0l7jt_KfBAXPyuTyuXa9jJjeFR0&v=1&r=http%3a%2f%2fstatic.panoramio.com%2fphotos%2flarge%2f84118355.jpg&p=DevEx,5014.1", "accentColor": "A36728", "height": 768, "hostPageDisplayUrl": "panoramio.com/photo/84118355", "name": "Panoramio - Photo of Bahawalpur railway station", "width": 1024, "imageId": "FE04EA82163F27DC0A8449CF2086E4DA4F359DF7", "imageInsightsToken": "ccid_1683LeSg*mid_FE04EA82163F27DC0A8449CF2086E4DA4F359DF7*simid_608010054465029867", "datePublished": "2013-01-01T12:00:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=0NEX-sC8BaLrZ9HDkSbA_7kztZ1BoVoihkkvnL2tGiQ&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3dFE04EA82163F27DC0A8449CF2086E4DA4F359DF7%26simid%3d608010054465029867&p=DevEx,5012.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=l9wqPINQPoe9u5N_qiFUtBQ6PrxdwEPiwObrCwBTQ2U&v=1&r=http%3a%2f%2fpanoramio.com%2fphoto%2f84118355&p=DevEx,5013.1", "thumbnailUrl": "https://tse2.mm.bing.net/th?id=OIP.1683LeSgJHoFhxX-tKhGSAEsDh&pid=Api", "thumbnail": {"width": 300, "height": 225}, "contentSize": "125011 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=1OS0LXGeQJbC9gOsRy00e-ae0535j7iNl4qiaNTTG0I&v=1&r=http%3a%2f%2fphotos.wikimapia.org%2fp%2f00%2f05%2f21%2f47%2f89_big.jpg&p=DevEx,5020.1", "accentColor": "5B4F36", "height": 361, "hostPageDisplayUrl": "wikimapia.org/1649944/Bahawalpur-Railway-Station", "name": "Bahawalpur Railway Station - Bahawalpur (\u0628\u06c1\u0627\u0648\u0644\u067e\u0648\u0631)", "width": 500, "imageId": "5464C96913992D44983D6D8CBD36CB6E679FEA3C", "imageInsightsToken": "ccid_JhLSwAc0*mid_5464C96913992D44983D6D8CBD36CB6E679FEA3C*simid_607998234704153808", "datePublished": "2016-12-09T20:58:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=IJTtTeRFNBA0xr1DyZcz6AMb43pJFV25m3WrDfLhQls&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d5464C96913992D44983D6D8CBD36CB6E679FEA3C%26simid%3d607998234704153808&p=DevEx,5018.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=MVElDiTqkKkcRJKEQxgr1yxRbwh-DpMNfT7lA6g1ivg&v=1&r=http%3a%2f%2fwikimapia.org%2f1649944%2fBahawalpur-Railway-Station&p=DevEx,5019.1", "thumbnailUrl": "https://tse1.mm.bing.net/th?id=OIP.JhLSwAc0HwFeWsHjAUYStgEsDY&pid=Api", "thumbnail": {"width": 300, "height": 216}, "contentSize": "28945 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=t6oOsr-23sNP-TFFzn39BVuagjYmXknVGiIWYD_tJv0&v=1&r=http%3a%2f%2fnativepakistan.com%2fwp-content%2fuploads%2fPhoto-of-Bahawalpur-RailwayS-tation-Photos-of-Bahawalpur.jpg&p=DevEx,5026.1", "accentColor": "49418A", "height": 347, "hostPageDisplayUrl": "nativepakistan.com/photos-of-bahawalpur", "name": "Photo of Bahawalpur Railway Station - Photos of Bahawalpur", "width": 500, "imageId": "7A05E50C94144666BFEB7BEECE6FB3DFC3313E18", "imageInsightsToken": "ccid_wS0pep46*mid_7A05E50C94144666BFEB7BEECE6FB3DFC3313E18*simid_607992170213084482", "datePublished": "2012-09-21T23:07:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=2kFu0Xn07bcJKuZI03iY3Ihq99ZiKFOvd0PXvVWqt94&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d7A05E50C94144666BFEB7BEECE6FB3DFC3313E18%26simid%3d607992170213084482&p=DevEx,5024.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=ht8SkbUIRgMkFq4yXvbHpmsINok4VTcxu0FiwMayk9A&v=1&r=http%3a%2f%2fnativepakistan.com%2fphotos-of-bahawalpur%2f&p=DevEx,5025.1", "thumbnailUrl": "https://tse3.mm.bing.net/th?id=OIP.wS0pep46eEsGSSY39RNxLQEsDQ&pid=Api", "thumbnail": {"width": 300, "height": 20

i am using this code but not getting accurate results

start = 0
while True:                                                       
  p = data[start:].find('hostPageDisplayUrl')                         
  if p == -1: buffer                                            
  q = data[start+p+12:].find('hostPageDisplayUrl')                           
  r = data[start+p+q+12:].find('.')                             
  print (data[start+p+q+12:start+p+q+r+12] , file = log)        
  start = start+p+q+r+12
Ali Jafar
  • 15
  • 1
  • 4

2 Answers2

0

As mentioned, your data seems to be a JSON file, but it does not completely fullfill the JSON format. After checking that it is indeed a valid JSON here, you can do something like this:

import json

def _finditem(obj, key):  # http://stackoverflow.com/a/14962509/2585092
    if key in obj: return obj[key]
    for k, v in obj.items():
        if isinstance(v,dict):
            item = _finditem(v, key)
            if item is not None:
                return item

def get_url(file_name):
    try:
        with open(file_name) as file:
            data = json.load(file)
    except FileNotFoundError:
        return None

    return _finditem(data, 'hostPageDisplayUrl')

Alternatively using regular expressions:

def find_urls(text):
    import re

    pattern = r'\"hostPageDisplayUrl\":\s*"([^"]*)"'
    return re.findall(pattern, text)

print(find_urls(test))

Result for your example:
['wikimapia.org/1649944/Bahawalpur-Railway-Station', 'panoramio.com/photo/84118355', 'wikimapia.org/1649944/Bahawalpur-Railway-Station', 'nativepakistan.com/photos-of-bahawalpur']

Warning: This only works while your URLs do not contain (escaped) double quotation marks "!


edit: For base urls:

def find_urls(text):
    import re

    pattern = r'\"hostPageDisplayUrl\":\s*"([^"]*)"'
    return re.findall(pattern, text)

def base_url(url):
    import re

    return re.search(r'(https?://)?(www\.)?([^/]*)', url)[3]

print([base_url(u) for u in find_urls(test)])

Result for your example:
['wikimapia.org', 'panoramio.com', 'wikimapia.org', 'nativepakistan.com']

Regular expression explanation:

\"hostPageDisplayUrl\":\s*"([^"]*)"

We search for a string, with a leading and trailing " and group it: "([^"]*)"
Before that, with any amount of seperators \s* we need the exact string "hostPageDisplayUrl":

(https?://)?(www\.)?([^/]*)

Ignoring any leading http(s):// and www., we want the part of the url before the first / and group it: ([^/]*)

yspreen
  • 1,759
  • 2
  • 20
  • 44
  • Using regular expression if i try to fetch only base url e.g wikimapia.org panoramio.com wikimapia.org nativepakistan.com and output should be like this. what should i have to do? – Ali Jafar Apr 09 '17 at 14:44
  • Added that to my answer. – yspreen Apr 09 '17 at 14:49
  • def base_url(url): import re return re.search(r'(https?://)?(www\.)?([^/]*)', url)[3] print([base_url(u) for u in find_urls(test)]) this is not working can you tell me how can i do it using by fetching URLS from text file? – Ali Jafar Apr 09 '17 at 18:56
  • replace `test` with `open("json_file.txt").read()`: `print([base_url(u) for u in find_urls(open("json_file.txt").read())])` or just do `test = open("json_file.txt").read()` before the last line – yspreen Apr 09 '17 at 18:59
0

From your comment I understand that the file data is a json saved as text file. So you can directly load the json data from text file and can get the values. Your code should be like this

json_data=json.loads(open("json_file.txt").read())
for data in json_data:
    print data["hostPageDisplayUrl"] #this will print all the urls

I posted this because a programming language is ment to do a job with efficiency and less line of code.

Mani
  • 5,401
  • 1
  • 30
  • 51