How to save looping dictionary from beautifulsoup4 scraping result to JSON file format..?

Question

I have scraping data from a website page using beautifulsoup4, and save the scraping result to a dictionary of list like this:

DATA = [
            TITLE, {
                'IMAGES': IMAGE,
                'URL_VIDEOS': URL_VIDEOS,
                'DESCRIPTIONS': DESCRIPTIONS,
                'SYNOPSIS': SYNOPSIS
            }
        ]

which is the values IMAGE, URL_VIDOES, DESCRIPTIONS AND SYNOPSIS are for the variable scraping result.

and I try save the variable data to a .json file extension with this code:

json_file = open('result.json', 'w')
json.dump(DATA, json_file)
json_file.close()

I got the result like this:

["Action Fruits", {"IMAGES": "http://animeindo.video/wp-content/uploads/2017/07/rsz_heroin.jpg", "URL_VIDEOS": "http://www.mp4upload.com/embed-q7xxgge1yu1c.html", "DESCRIPTIONS": {"Japanese": " \u30a2\u30af\u30b7\u30e7\u30f3\u30d2\u30ed\u30a4\u30f3 \u30c1\u30a2\u30d5\u30eb\u30fc\u30c4", "\nProducer": " Diomedea", "\nType": " TV Series", "\nStatus": " Ongoing", "\nGenre": " Comedy, School, Slice of Life", "\nDurasi": " 24 min", "\nEpisode": " \u2013", "\nRating": " 6.11", "\nAdded On": " July 12th, 2017"}, "SYNOPSIS": "Japanese: \u30a2\u30af\u30b7\u30e7\u30f3\u30d2\u30ed\u30a4\u30f3 \u30c1\u30a2\u30d5\u30eb\u30fc\u30c4\nProducer: Diomedea\nType: TV Series\nStatus: Ongoing\nGenre: Comedy, School, Slice of Life\nDurasi: 24 min\nEpisode: \u2013\nRating: 6.11\nAdded On: July 12th, 2017\nSinopsis:\nPerjuangan pahlawan lokal dalam menyelamatkan daerahnya.\n"}]

But over the looping in that scrape, the result in that .json file always be overwritten, that's not added the new data, just overwritten with the new data like this:

["Happy", {"IMAGES": "https://1.bp.blogspot.com/-SUq5_dpoIlM/VwpKqqsEzNI/AAAAAAAAM50/H81MUyDLZA0ctj8zo8JbuUVPPz4sxQulw/s1600/77219__1460292250_36.80.228.117.jpg", "URL_VIDEOS": "http://www.mp4upload.com/embed-ptj9hmeefar8.html", "DESCRIPTIONS": {"Japanese": " \u3042\u3093\u30cf\u30d4\u266a", "\nProducer": " Silver Link", "\nType": " TV Series", "\nStatus": " Ongoing", "\nGenre": " Comedy, School, Slice of Life", "\nDurasi": " 23 min. per ep.", "\nEpisode": " 12", "\nRating": " 7.06", "\nAdded On": " April 10th, 2016"}, "SYNOPSIS": "Japanese: \u3042\u3093\u30cf\u30d4\u266a\nProducer: Silver Link\nType: TV Series\nStatus: Ongoing\nGenre: Comedy, School, Slice of Life\nDurasi: 23 min. per ep.\nEpisode: 12\nRating: 7.06\nAdded On: April 10th, 2016\nSinopsis:\nMenceritakan kelas 1-7 di Akademi Tennomifune, di mana semua murid yang suka sial berkumpul. Hibari, salah satu murid di kelas ini, bertemu dengan si sial Hanako di hari pertama sekolah, dan bersama-sama mereka berjuang mencari hidup bahagia di sekolah mereka.\n"}]

And the next result also overwritten...

I want to added new data, and save all of the result of scraping with one .json file. So.. how to do that..?

score 2 · Answer 1 · answered Dec 21 '18 at 00:12

2

'w' file mode would rewrite the file every time you write to it.

'a' would not work here as well as it would result into an invalid JSON file.

What you should do is to collect the results while you scrape (into a list?) and then dump into the JSON file once after you are done looping through the data.

answered Dec 21 '18 at 00:12

alecxe

462,703
120
1,088
1,195

Hey.. @alecxe, can you specify more details about it..? – Tri Dec 21 '18 at 00:21
@Tri well, sure, but for that I would need to see more of your code - especially the part where you are setting the `DATA` values. – alecxe Dec 21 '18 at 00:22

score 1 · Answer 2 · answered Dec 21 '18 at 10:02

How you select IMAGES or URL_VIDEOS by TITLE? I think your json is not correct because Title is value not key, maybe it should be like this format

{
  "title A" : {"IMAGES" : "IMAGE A"},
  "title B" : {"IMAGES" : "IMAGE B"}
}

Or

[
  {"Title" : "title A", "IMAGES" : "IMAGE A"},
  {"Title" : "title B", "IMAGES" : "IMAGE B"}
]

Let try with first example, You need to read previous json and update() with new data, first make sure to delete result.json

import os.path

....
DATA = {"Action Fruits": {"IMAGES": "a.jpg", "URL_VIDEOS" : "http://a.mp4"}}

OLD_DATA = {} # set old data to this if file not exist
if os.path.isfile('result.json'):
    with open('result.json', 'r') as f:
        OLD_DATA = json.load(f)
        # {"Happy" : {"IMAGES" : "b.jpg", "URL_VIDEOS" : "http://b.mp4"}}

# concatenate old and new data
DATA.update(OLD_DATA)
with open('result.json', 'w') as f:
    json.dump(DATA, f)

result.json

{
  "Action Fruits": {"IMAGES": "a.jpg", "URL_VIDEOS": "http://a.mp4"}, 
  "Happy": {"IMAGES": "b.jpg", "URL_VIDEOS": "http://b.mp4"}
}

Tri · Answer 3 · 2018-12-21T00:04:37.217

0

I have resolved this by this following code:

 with open('result.json', 'a') as outfile:
      outfile.write(json.dumps(DATA, sort_keys=True, indent=4))

I got the answer from here.

edited Dec 21 '18 at 00:04

answered Dec 21 '18 at 00:03

Tri

369
1
4
13

Are you sure you gonna end up with a *valid JSON* at the end? – alecxe Dec 21 '18 at 00:04
ah.. I'm too early to happy, I look at the _JSON_ result again, and I got fews red highlight in my Pycharm – Tri Dec 21 '18 at 00:07

How to save looping dictionary from beautifulsoup4 scraping result to JSON file format..?

3 Answers3