0

I have a JSON file containing several dictionaries; each with lots of information about a specific website. I would like to write a program which can iterate through the dictionaries and output strictly the HTML code found within each dictionary, which is found (parsed) as data["p80"]["http"]["get"]["body"].

Below is an example of two of the dictionaries in the JSON file.

{"p80":{"http":{"get":{"body": "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\t<head>\n\t\t<title>Motormax</title>\n                    <meta name=viewport content=\"width=device-width, initial-scale=1.0\" />\r\n<meta name=\"google-site-verification\" content=\"wqSGgrJPlLskInflNQPXn9oY25etuJYuRQonZ0k0I_o\" />\r\n<link href='https://fonts.googleapis.com/css?family=Lato:400,700,900' rel='stylesheet' type='text/css'>\r\n        \t\t<meta name=\"description\" content=\"\" /> \n\t\t<meta name=\"keywords\" content=\"Motormaax, Renault, Chevrolet, Nissan, Peugeot, Volkswagen, Ford, Planes de ahorro, financiaci\u00f3n, cuotas, autos en cuotas\" /> \n\t\t<meta http-equiv=\"Content-type\" content=\"text/html; charset=UTF-8\" />\n\t\t\n        <script src=\"/processedjs/kms427.js\" type=\"text/javascript\"></script>        <link rel=\"stylesheet\" type=\"text/css\" href=\"/processedcss/kms427.css\" />\n\t\t\n\t\t<script type=\"text/javascript\">\n\t\t\tvar dataLayer = [];\n\t\t</script>\n        <script type=\"text/javascript\">(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\r\nnew Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\r\nj=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\r\n'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\r\n})(window,document,'script','dataLayer','GTM-582XL3');</script>\n\t\t\n\t\t\n\n\n\t</head>\n\t<body>\n\t<div style=\"visibility: hidden; display: none;\"></div>\r\n<div class=\"main\">\r\n\t\t\t<p><img src=\"/templatepagina/template_246/images/logo_motormax.png\" alt=\"Motormax\" /></p>\r\n\t\t\t<h1>TE ACOMPA\u00d1AMOS EN LA COMPRA DE TU <b>NUEVO AUTO</b></h1>\r\n\t\t\t<p id=\"line\"></p>\r\n\t\t\t\r\n\t\r\n<ul class=\"marcas\">\t\t\t\r\n<a href=\"/peugeot\"><li id=\"peugeot\"><p>Peugeot</p></li></a>\r\n\t\t\t\t<a href=\"/fiat\"><li id=\"fiat\"><p>fiat</p></li></a>\r\n\t\t\t\t<a href=\"/ford\"><li id=\"ford\"><p>ford</p></li></a>\r\n\t\t\t\t<a href=\"/renault\"><li id=\"renault\"><p>renault</p></li></a>\r\n                                <a href=\"/volkswagen\"><li id=\"vw\"><p>vw</p></li></a>\r\n\t\t\t\r\n\t\r\n\t\t\t\t<!-- <li id=\"nissan\"><p>nissan</p></li> -->\r\n\t\t\t</ul>\r\n\t\t</div>\t\r\n</body>\n</html>", "body_sha256": "fEHZCw9VEdmwVabOd0g8TntigYiA9AsL+sKicdipejU=", "headers": {"cache_control": "post-check=0, pre-check=0", "content_length": "2118", "content_type": "text/html; charset=UTF-8", "expires": "Thu, 19 Nov 1981 08:52:00 GMT", "pragma": "no-cache", "server": "Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16", "unknown": [{"key": "date", "value": "Mon, 07 Nov 2016 16:36:25 GMT"}], "x_powered_by": "PHP/5.4.16"}, "metadata": {"description": "Apache httpd 2.4.6", "manufacturer": "Apache", "product": "httpd", "version": "2.4.6"}, "status_code": 200, "status_line": "200 OK", "title": "Motormax", "timestamp":"2016-11-09 12:28:36"}}}}
{"p80":{"http":{"get":{"body": " \n<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\n\"http://www.w3.org/TR/html4/loose.dtd\">\n<html>\n<head>\n<title>Kody pocztowe - wyszukiwarka</title>\n<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=iso-8859-2\">\n<META NAME=\"Keywords\" CONTENT=\"kody pocztowe, kod pocztowy, Poczta Polska, przesy\ufffdki, listy\">\n<META NAME=\"Description\" CONTENT=\"Na tej stronie mo\ufffdesz wyszuka\ufffd kody pocztowe dowolnych miejscowo\ufffdci w Polsce. Podaj miasto, ulic\ufffd i znajd\ufffd potrzebny Ci kod pocztowy. Jest on niezb\ufffddny, je\ufffdli list lub inna przesy\ufffdka ma dotrze\ufffd do adresata na terenie Polski.\">\n<META HTTP-EQUIV=\"Content-Language\" CONTENT=\"PL\">\n<META NAME=\"distribution\" CONTENT=\"Global\">\n<META NAME=\"revisit-after\" CONTENT=\"2 days\">\n<META NAME=\"robots\" CONTENT=\"INDEX,FOLLOW\">\n<style type=\"text/css\">body, td {\nfont-family:arial;\nfont-size:12px;\nmargin:10px 0 10px 0;\ncolor:#000000;\n}\n\n.row { padding: 4px 10px 4px 0; text-align:left}\ninput { }\nimg { border:0;}\n.thead {\ncolor:#FFFFFF; font-size:10px;\nbackground-image:url(http://00-000.pl/gfx/lay/box_top_bg.gif);\npadding:0;\n}\n.pltd{\npadding-right:40px;\ntext-align:right;\nbackground-image:url(http://00-000.pl/gfx/lay/box_bg.gif);\n\ncolor:#000000;\nfont-family:arial;\nfont-size:13px;\nfont-weight:bold;\n}\n.zera{\ncolor:#f26624;\nfont-family:arial;\nfont-size:30px;\n}\n.zeras{\ncolor:#f26624;\nfont-family:arial;\nfont-size:20px;\n}\n.top_right{\nbackground-image:url(http://00-000.pl/gfx/top_bg.gif);\ntext-align:right;\nwidth:auto;\ncolor:#FFFFFF; font-weight:bold; padding-right:20px;}\n.top_bar{\nbackground-color:#eeeeee;\npadding:0 0px 0 8px;\nfont-size:10px;\n\n}\n\na:link{\ntext-decoration:underline;\ncolor:#000000;\n}\na:visited{\ntext-decoration:underline;\ncolor:#000000;\n}\na:hover{ color:#FF0000;\ntext-decoration:none;\n}\na:link.white{\ncolor:#ffffff;\ntext-decoration:none;\n\n}\na:visited.white{\ncolor:#ffffff;\ntext-decoration:none;\n\n}\na:hover.white{ color:#FF3300;\ntext-decoration:underline;\n\n}\n\na:link.head{\ncolor:#ffffff;\ntext-decoration:none;\nfont-weight:bold;\n}\na:visited.head{\ncolor:#ffffff;\ntext-decoration:none;\nfont-weight:bold;\n}\na:hover.head{ color:#FFFF00;\ntext-decoration:underline;\nfont-weight:bold;\n}\n\nli {\nlist-style-type:square;\nlist-style-position:inside;\n}\nh1{\nfont-family:arial;\nfont-size:25px;\nmargin:0 0 5px 0;\n}\nh3{\nfont-size:15px;\ncolor:#993300;\nmargin:0 0 10px 0;\npadding:0;\n\n}\na:link.linkbox{\ncolor:#009900;\ntext-decoration:none;\n}\na:visited.linkbox{\ncolor:#009900;\ntext-decoration:none;\n}\na:hover.linkbox{\ncolor:#009900;\ntext-decoration:underline;\n}\n\n\n.top_box_orange {\nbackground-image:url(http://00-000.pl/gfx/lay/box_top_bg_orange.gif);\nborder-bottom:1px solid #ffffff; \nfont-weight:bold; padding-left:9px;\nheight:21px;\ncolor:#FFFFFF;\n}\n.top_box_grey {\nbackground-image:url(http://00-000.pl/gfx/lay/box_top_bg_grey.gif);\nborder-bottom:1px solid #ffffff; \nfont-weight:bold; height:21px; padding-left:9px;\ncolor:#FFFFFF;\n}\n.top_box_grey_k {background-color:#999999;\nborder-bottom:1px solid #ffffff; \nfont-weight:bold; height:21px; padding-left:9px;\ncolor:#FFFFFF;\n}\n\n.box{\nbackground-image:url(http://00-000.pl/gfx/lay/box_bg.gif);\npadding:15px 10px 20px 10px;\nline-height:15px\n}\n\n.form_ok {\nmargin:10px 0 10px 0;\nbackground-color:#FFFFCC;\ncolor:#99CC00;\nfont-size:14px;\nfont-weight:bold;\npadding:20px;\ntext-align:left;\nborder: 1px solid #009900;\n}\n.form_bad {\nmargin:10px 0 10px 0;\nbackground-color:#FFFFCC;\ncolor:#CC0000;\nfont-size:14px;\nfont-weight:bold;\npadding:20px;\ntext-align:left;\nborder: 1px solid #990000;\n}\n\na.button {\ndisplay:block;\nbackground-color:#f26623;\ncolor:#fff;\npadding:5px 10px;\n width:150px;\nmargin:0 10px 0 10px;\nfloat:right;\ntext-align:center;\ntext-decoration:none;\n}\na:visited.button { color:#fff;}\na:hover.button {\ntext-decoration:underline;\ncolor:#000;\n\n}\n</style>\n</head>\n<body>\n\n<table cellpadding=\"0\" cellspacing=\"0\" width=\"80%\" align=\"center\" >\n<tr><td align=\"left\" width=\"190\" colspan=\"2\"><a href=\"http://00-000.pl\"><img src=\"http://00-000.pl/gfx/logo.gif\" border=\"0\" width=\"190\" height=\"70\"></a></td>\n<td width=\"100%\" class=\"top_right\" colspan=\"2\">wyszukiwarka kod\ufffdw pocztowych</Td>\n<td width=\"4\"><img src=\"http://00-000.pl/gfx/top_right.gif\" border=\"0\" width=\"4\" height=\"70\"></td>\n</tr>\n\n<tr>\n<td width=\"4\"><img src=\"http://00-000.pl/gfx/lay/top_bar_left.gif\" border=\"0\" width=\"4\" height=\"21\"></td>\n<Td width=\"186\" class=\"top_bar\">Ostatnia aktualizacja: ", "body_sha256": "/OYNeyTKqqDQNpmG1rmKfK8OYAKfUDP1l8jGUnVlyR8="}}}}

Here's my code so far.

import json
from pprint import pprint
import sys

if __name__ == "__main__":
    file = open('sample101.json', 'r')

    for dict in file:
        for key, value in file.items():
            pprint(file["p80"]["http"]["get"]["body"])

    file.close()

Any help would be greatly appreciated as I am new to Python. Thank you so much!

  • As the code stands now, `file` does not contain is not a dict. You will need to read in the file, and then convert from json to a dict. Searching SO for answer can be very helpful. An example: http://stackoverflow.com/questions/19483351/converting-json-string-to-dictionary-not-list – Stephen Rauch Jan 09 '17 at 06:46
  • I think a more useful example would be the original file but with most of its stuff removed. The first two dicts, but most of the dict contents removed and an extremely simple html document. – tdelaney Jan 09 '17 at 07:05
  • `pprint` isn't going to pretty print html... it only sees a very long string and doesn't know tags and etc... – tdelaney Jan 09 '17 at 07:06

2 Answers2

1

json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)

Deserialize fp (a .read()-supporting file-like object containing a JSON document) to a Python object using this conversion table.

file = open('sample101.json', 'r')
py_dict = json.load(file)
宏杰李
  • 11,820
  • 2
  • 28
  • 35
0

If I've got this right, you have a json file that holds a list of dictionaries and you want to extract html from the dictionaries. In that case, you need to parse the entire file as json and then the extraction is simple. Don't name a variable dict because it masks the built-in dict class, but otherwise this should do.

import json
from pprint import pprint
import sys

if __name__ == "__main__":
    for data_dict in json.load(open('sample101.json', encoding='utf-8')):
        pprint(data_dict["p80"]["http"]["get"]["body"])

If you are worried about bad data, you could wrap this all in a try/except block and grab the items one at a time.

for data_dict in json.load(open('sample101.json', encoding='utf-8')):
    for key in "p80", "http", "get", "body":
        try:
            data_dict = data_dict[key]
        except (TypeError, KeyError):
            print("Error at", key)
            print(repr(data_dict))
            raise  # or remove to continue with next item

UPDATE

Suppose its not a json file but is a file with one json string per line. Then we rework the loop a bit (and stop calling it xxx.json!).

for line in open('sample101.json', encoding='utf-8'):
    data_dict = json.loads(line):
    for key in "p80", "http", "get", "body":
        try:
            data_dict = data_dict[key]
        except (TypeError, KeyError):
            print("Error at", key)
            print(repr(data_dict))
            raise  # or remove to continue with next item
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • I believe that `json.load` requires that you pass a file-like object, not a `str`. – mgilson Jan 09 '17 at 16:39
  • That works, but I'm unable to parse the correct value. I am getting a `TypeError: string indices must be integers` – Tommy Searle Jan 09 '17 at 20:07
  • On which line? It sounds like the data isn't always in the format you want. You could add an exception handler that prints the data when things go wrong. – tdelaney Jan 09 '17 at 20:11
  • Perhaps you could update your question with the stack trace. Is it in`pprint` itself or `data_dict["p80"]["http"]["get"]["body"]`? – tdelaney Jan 09 '17 at 20:13
  • Never mind I fixed that error. The error I am still having is that `json.load` only works with one dictionary (like the example I gave in the original question). How can I get it to go through the several dictionaries contained in the file? – Tommy Searle Jan 09 '17 at 21:10
  • @TommySearle - you haven't specified how those dicts are contained in the file. I guessed it was one bit JSON file with an outer list of dictionaries. It may be that there is one JSON string per line and you handle that differently. I asked you to post a trimmed down example of the original file... and this is why. – tdelaney Jan 09 '17 at 21:15
  • Sorry about that @tdelaney . I updated the original question with two examples now. There is one JSON string per line. I trimmed down the individual strings as well. Thank you for your help. – Tommy Searle Jan 09 '17 at 21:23