0

I can see that HTML is embedded in the following data request that I made from a website. How can I convert this into HTML or is there a better way to parse this info using Python? If it helps, I'm trying to web scrape certain pieces of information from this website, but due to pagination it's a bit more complicated.

{"results": "    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">99 / 99</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/ccc08s1\">CCC &#39;08 S1 - It&#39;s Cold Here!</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2021-02-25T16:02:06+00:00\" class=\"time-with-rel\" title=\"Feb. 25, 2021, 11:02 a.m.\"\n          data-format=\"{time}\">\n        on Feb. 25, 2021, 11:02 a.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/3451674\">5pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>100%</b> (5.0pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">50 / 50</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/ccc00s1\">CCC &#39;00 S1 - Slot Machines</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2021-02-20T18:07:10+00:00\" class=\"time-with-rel\" title=\"Feb. 20, 2021, 1:07 p.m.\"\n          data-format=\"{time}\">\n        on Feb. 20, 2021, 1:07 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/3436991\">5pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>95%</b> (4.8pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">100 / 100</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/ccc03s1\">CCC &#39;03 S1 - Snakes and Ladders</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2021-02-20T16:53:42+00:00\" class=\"time-with-rel\" title=\"Feb. 20, 2021, 11:53 a.m.\"\n          data-format=\"{time}\">\n        on Feb. 20, 2021, 11:53 a.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/3436734\">5pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>90%</b> (4.5pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">50 / 50</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/coci14c5p1\">COCI &#39;14 Contest 5 #1 Funghi</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-09-15T16:35:41+00:00\" class=\"time-with-rel\" title=\"Sept. 15, 2022, 12:35 p.m.\"\n          data-format=\"{time}\">\n        on Sept. 15, 2022, 12:35 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4842240\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>86%</b> (2.6pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">100 / 100</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/vmss7wc15c2p1\">VM7WC &#39;15 #2 Bronze - Recruits!</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-09-15T13:54:42+00:00\" class=\"time-with-rel\" title=\"Sept. 15, 2022, 9:54 a.m.\"\n          data-format=\"{time}\">\n        on Sept. 15, 2022, 9:54 a.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4842032\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>81%</b> (2.4pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">100 / 100</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/tsoc15c1p1\">TSOC &#39;15 Contest 1 #1 - Molecular or Non-Molecular?</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-09-15T00:36:46+00:00\" class=\"time-with-rel\" title=\"Sept. 14, 2022, 8:36 p.m.\"\n          data-format=\"{time}\">\n        on Sept. 14, 2022, 8:36 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4840848\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>77%</b> (2.3pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">30 / 30</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/p118ex5\">BlueBook - Max is Last</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-09-14T23:52:29+00:00\" class=\"time-with-rel\" title=\"Sept. 14, 2022, 7:52 p.m.\"\n          data-format=\"{time}\">\n        on Sept. 14, 2022, 7:52 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4840707\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>74%</b> (2.2pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">30 / 30</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/p129ex5\">BlueBook - Find the Character</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-04-03T00:34:24+00:00\" class=\"time-with-rel\" title=\"April 2, 2022, 8:34 p.m.\"\n          data-format=\"{time}\">\n        on April 2, 2022, 8:34 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4474590\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>70%</b> (2.1pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">10 / 10</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/p287ex5\">BlueBook - Digits</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-04-03T00:27:40+00:00\" class=\"time-with-rel\" title=\"April 2, 2022, 8:27 p.m.\"\n          data-format=\"{time}\">\n        on April 2, 2022, 8:27 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4474522\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>66%</b> (2.0pp)\n    </div>\n</div></div>\n    <div class=\"submission-row\"><div class=\"sub-result AC\">\n    <div class=\"score\">17 / 17</div>\n    <div class=\"state\">\n        <span title=\"Accepted\" class=\"status\">AC</span>\n        |\n        <span class=\"language\">PY3</span>\n    </div>\n</div>\n\n<div class=\"sub-main\">\n    <div class=\"sub-info\">\n        <div class=\"name\">\n            <a href=\"/problem/wc17c4j1\">WC &#39;17 Contest 4 J1 - Fight or Flight</a>\n        </div>\n        <div class=\"time\"><span data-iso=\"2022-03-27T18:03:32+00:00\" class=\"time-with-rel\" title=\"March 27, 2022, 2:03 p.m.\"\n          data-format=\"{time}\">\n        on March 27, 2022, 2:03 p.m.\n    </span></div>\n    </div>\n</div>\n\n<div class=\"sub-pp sub-usage\">\n    <div class=\"pp\">\n        <a href=\"/submission/4452136\">3pp</a>\n    </div>\n    <div class=\"pp-weighted\">\n            weighted <b>63%</b> (1.9pp)\n    </div>\n</div></div>\n", "has_more": true}
MattDMo
  • 100,794
  • 21
  • 241
  • 231
02fentym
  • 1,762
  • 2
  • 16
  • 29
  • It's a dict, possibly created from JSON. – MattDMo Sep 20 '22 at 00:40
  • The format is JSON, with the html in the `"results"` dictionary value. – Grismar Sep 20 '22 at 00:41
  • That's JSON with a `results` key with a string value that looks like raw HTML. Just take `value['results']` and throw it into an HTML parser. – metatoaster Sep 20 '22 at 00:41
  • @metatoaster Thanks for the response. Would BeautifulSoup be able to parse HTML? I'm unfamiliar with a lot of this stuff, but I can figure it out if pointed in the right direction. Thanks. – 02fentym Sep 20 '22 at 00:50
  • 1
    Yes, if you searched around you would have found plenty of resources online for this, or threads like [this](https://stackoverflow.com/questions/15576652/parsing-html-with-beautifulsoup-in-python), [this](https://stackoverflow.com/questions/62807158/python-beautifulsoup-get-html-from-dynamic-page), [this](https://stackoverflow.com/questions/40775930/using-beautifulsoup-to-modify-html), or most relevant, [this](https://stackoverflow.com/questions/67738504/how-to-extract-html-from-json-response). – metatoaster Sep 20 '22 at 00:53
  • html_obj = BeautifulSoup(value['results'], "html.parser") div_list = html_obj.find_all('div') – rachel_hong Sep 20 '22 at 01:31
  • That a JSON format – Joel Wembo Sep 20 '22 at 02:10
  • `dictionary = json.loads(your_string)` and later `html = dictionary["results"]`. But if you get it using `requests` ie, using `response = requests.get()` then you can use `dictionary = response.json()` (without string inside `json()` – furas Sep 20 '22 at 03:32

0 Answers0