0

I have a 10GB JSON file which I want to read, but I realized I can't use json.load because it returns OSError due to size.

I want to create a generator which yields a dictionary by dictionary. How can I get at this?

The file I have is a list of dictionaries. I tried the following but yielded all of dictionaries:

import codecs
import json

def json_iterator(filename):
    with codecs.open(filename,'r','utf-8') as f:
        for jsonline in f:
            yield json.loads(jsonline,encoding='utf-8')

filename = 'tweets.json'
read_file = json_iterator(filename)
len(read_file.__next__())

This is a data sample:

sample = [{'fullname': '종잇-쟝',
  'html': '<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="ko">근데 <strong>사드</strong>가 북한만을 위해 존재하는건지, 미국이 <strong>중국</strong>, 러시아 까지 같이 견제하기 위한건지 생각해야 한다.\n\n내가 보기엔 <strong>사드</strong>배치는, 그냥 미국 시다바리짓 하는 것 같다.</p>',
  'id': '752291220806733824',
  'likes': '0',
  'replies': '0',
  'retweets': '0',
  'text': '근데 사드가 북한만을 위해 존재하는건지, 미국이 중국, 러시아 까지 같이 견제하기 위한건지 생각해야 한다.\n\n내가 보기엔 사드배치는, 그냥 미국 시다바리짓 하는 것 같다.',
  'timestamp': '2016-07-10T23:59:38',
  'url': '/__repap_eht/status/752291220806733824',
  'user': '__repap_eht'},
 {'fullname': '구르는 돌',
  'html': '<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="ko">우리 음성의 개돼지들은 오늘 오후 2시 떼거리로 모여 <strong>사드</strong>반대를 외칠겁니다. 그런데 음성 배치 반대가 아니라 대한민국 <strong>사드</strong> 배치를 반대합니다. 미국 무기가 왜 우리나라에 배치되어야 합니까? 북한 때문이라고요? 그러면 그동안 왜 침묵하고있었나요?</p>',
  'id': '752291209511510018',
  'likes': '2',
  'replies': '1',
  'retweets': '10',
  'text': '우리 음성의 개돼지들은 오늘 오후 2시 떼거리로 모여 사드반대를 외칠겁니다. 그런데 음성 배치 반대가 아니라 대한민국 사드 배치를 반대합니다. 미국 무기가 왜 우리나라에 배치되어야 합니까? 북한 때문이라고요? 그러면 그동안 왜 침묵하고있었나요?',
  'timestamp': '2016-07-10T23:59:35',
  'url': '/hatstone121/status/752291209511510018',
  'user': 'hatstone121'},
 {'fullname': '\xa0뉴스 공유하는 스카니아\xa0',
  'html': '<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="ko">中 "<strong>사드</strong> 배치 지역과 교류 끊고 기업들 시장 진입도 막겠다" | 다음 뉴스 <a class="twitter-timeline-link" data-expanded-url="http://v.media.daum.net/v/20160711045125484?f=m" dir="ltr" href="h" rel="nofollow noopener" target="_blank" title="http://v.media.daum.net/v/20160711045125484?f=m"><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">v.media.daum.net/v/201607110451</span><span class="invisible">25484?f=m</span><span class="tco-ellipsis"><span class="invisible">\xa0</span>…</span></a></p>',
  'id': '752291178603683840',
  'likes': '0',
  'replies': '1',
  'retweets': '30',
  'text': '中 "사드 배치 지역과 교류 끊고 기업들 시장 진입도 막겠다" | 다음 뉴스 http://v.media.daum.net/v/20160711045125484?f=m\xa0…',
  'timestamp': '2016-07-10T23:59:28',
  'url': '/sakota_2nd/status/752291178603683840',
  'user': 'sakota_2nd'}]
halo09876
  • 2,725
  • 12
  • 51
  • 71
  • Does each line have a self-contained JSON that can be read and parsed on its own? This is important. – cs95 Dec 04 '18 at 10:19
  • @coldspeed No, the file is a list of dictionaries – halo09876 Dec 04 '18 at 10:20
  • Ouch, if it was 10GB, might've been a little easier to save each dictionary on a separate line. – cs95 Dec 04 '18 at 10:21
  • How about a data sample of a couple of lines, at least? – cs95 Dec 04 '18 at 10:21
  • @coldspeed I added the sample – halo09876 Dec 04 '18 at 10:24
  • Hmmm, did you copy paste the data verbatim from your file? I'm trying to understand if there's some way of parsing line by line, or using regex (slow, but possible). – cs95 Dec 04 '18 at 10:31
  • @coldspeed yes that's a snippet of the data – halo09876 Dec 04 '18 at 10:33
  • You mean to tell me the first line is exactly this: `sample = [{'fullname': '종잇-쟝',`? – cs95 Dec 04 '18 at 10:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/184682/discussion-between-song0089-and-coldspeed). – halo09876 Dec 04 '18 at 10:36
  • Not quite... can't guarantee prompt replies as I am a little busy. @ me here and I'll reply when I can :-) – cs95 Dec 04 '18 at 10:44
  • @coldspeed Yes the JSON file is a list so the first element is [ and the rest is the same as the sample. – halo09876 Dec 04 '18 at 10:47
  • Okay, then iterate over each line, accumulating data after each successive occurrence of the first "{" character you see... is that enough of a start? – cs95 Dec 04 '18 at 10:49
  • @coldspeed but what's line here? – halo09876 Dec 04 '18 at 10:50
  • @coldspeed I'm trying to use one of the answers https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python?rq=1 a StreamJsonListLoader, but I'm not following how to load the file – halo09876 Dec 04 '18 at 10:51
  • See if you can modify that solution to read N chunks of bytes at a time and try tentatively parsing the text. – cs95 Dec 04 '18 at 11:06
  • I should mention, that isn't JSON, that's a literal string dump of a python object—this is bad design. – cs95 Dec 04 '18 at 11:07
  • There should be a way to set '{' as a start marker and '}' as an end marker and read everything between them as a dictionary element. Then yield it and iterate for remaining chunks? – ParvBanks Dec 04 '18 at 11:12

0 Answers0