1

I have this large json file (10gb) and I only need to get the value of specific variables (text, date, geo). Right now, I'm reading and writing a new json file for each line. I don't think this is very efficient. Is there a better way to do this? This is what I've done so far.

import json

def writeJsonFile(jsonData,tweetCount):
        filenumber = "tweets/tweet%s.json" %tweetCount
        with open (filenumber,'w') as f:
            json.dump(jsonData,f)


def getJsonLine(filename):
    count = 0
    with open(filename) as infp:
        for line in infp:
            if line.strip():
                count +=1  
                jsonData = json.loads(line) 
                writeJsonFile(jsonData,count)

def readJsonFile(filename):
    with open (filename) as f:
        data = json.load(f)
        print(data['id'])



if __name__ == '__main__':
    getJsonLine("largeJsonFile.json")
    readJsonFile("outputJsonFile1.json")

This what the large json file looks like.

{"created_at":"Thu Oct 04 11:16:37 +0000 2018","id":1047807698782375937,"id_str":"1047807698782375937","text":"Thursday of the Sixth-Week in Ordinary Time. The Feast of Saint Francis of Assisi\nGospel: Luke 10: 1 - 12\n1After th\u2026 https:\/\/t.co\/DUyHZrQfxZ","source":"\u003ca href=\"http:\/\/instagram.com\" rel=\"nofollow\"\u003eInstagram\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":67864420,"id_str":"67864420","name":"baby0811","screen_name":"baby0811","location":"Philippines","url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":1030,"friends_count":2424,"listed_count":15,"favourites_count":5185,"statuses_count":79832,"created_at":"Sat Aug 22 10:42:20 +0000 2009","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1011954946974015488\/XR0aeW-X_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1011954946974015488\/XR0aeW-X_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/67864420\/1466209205","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[14.57338204,121.04939121]},"coordinates":{"type":"Point","coordinates":[121.04939121,14.57338204]},"place":{"id":"005de1fe214f002d","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/005de1fe214f002d.json","place_type":"city","name":"Mandaluyong City","full_name":"Mandaluyong City, National Capital Region","country_code":"PH","country":"Republic of the Philippines","bounding_box":{"type":"Polygon","coordinates":[[[121.016761,14.567448],[121.016761,14.602063],[121.061760,14.602063],[121.061760,14.567448]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"Thursday of the Sixth-Week in Ordinary Time. The Feast of Saint Francis of Assisi\nGospel: Luke 10: 1 - 12\n1After this the Lord appointed seventy others, and sent them on ahead of Him, two\u2026 https:\/\/t.co\/C3B3SFaa4o","display_text_range":[0,212],"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/C3B3SFaa4o","expanded_url":"https:\/\/www.instagram.com\/p\/BogloILAcnB\/?utm_source=ig_twitter_share&igshid=s1adebjetnj3","display_url":"instagram.com\/p\/BogloILAcnB\/\u2026","indices":[189,212]}],"user_mentions":[],"symbols":[]}},"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/DUyHZrQfxZ","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/1047807698782375937","display_url":"twitter.com\/i\/web\/status\/1\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1538651797372"}

{"created_at":"Thu Oct 04 11:16:37 +0000 2018","id":1047807699528761344,"id_str":"1047807699528761344","text":"Pinipilit na akong umuwi ng nanay ko next weekend","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":99445624,"id_str":"99445624","name":"Estoy","screen_name":"jedz2dmax","location":"Makati City, National Capital Region","url":null,"description":"Negrense - Hiligaynon | Padawan | Scorpio | ESTJ | Altiora Quaero - Animo *food adventure *the beach *green tea *Side #LoveFood  Nahanap na siya. \ud83d\ude0a","translator_type":"none","protected":false,"verified":false,"followers_count":2024,"friends_count":1326,"listed_count":10,"favourites_count":31444,"statuses_count":33872,"created_at":"Sat Dec 26 06:41:38 +0000 2009","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFF04D","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme19\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme19\/bg.gif","profile_background_tile":false,"profile_link_color":"204BD9","profile_sidebar_border_color":"FFF8AD","profile_sidebar_fill_color":"F6FFD1","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1029940958790483973\/i3FVGKvS_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1029940958790483973\/i3FVGKvS_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/99445624\/1528704582","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":{"id":"017a4afa29d71c65","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/017a4afa29d71c65.json","place_type":"city","name":"Makati City","full_name":"Makati City, National Capital Region","country_code":"PH","country":"Republic of the Philippines","bounding_box":{"type":"Polygon","coordinates":[[[120.998880,14.513482],[120.998880,14.579517],[121.067544,14.579517],[121.067544,14.513482]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1538651797550"}

{"created_at":"Thu Oct 04 11:16:38 +0000 2018","id":1047807702448070656,"id_str":"1047807702448070656","text":"@shnmndza Weh sa mindoro ka na ba","display_text_range":[10,33],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1047807434473988096,"in_reply_to_status_id_str":"1047807434473988096","in_reply_to_user_id":1712695734,"in_reply_to_user_id_str":"1712695734","in_reply_to_screen_name":"shnmndza","user":{"id":2199151322,"id_str":"2199151322","name":"meggy","screen_name":"meganflaviano","location":"f\u0113i l\u00f9 b\u012bn","url":"https:\/\/curiouscat.me\/meganflaviano","description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":1721,"friends_count":1822,"listed_count":3,"favourites_count":15651,"statuses_count":25058,"created_at":"Sun Nov 17 08:17:17 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1045911436596133888\/wm5qfm6R_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1045911436596133888\/wm5qfm6R_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2199151322\/1534124612","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":{"id":"017a4afa29d71c65","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/017a4afa29d71c65.json","place_type":"city","name":"Makati City","full_name":"Makati City, National Capital Region","country_code":"PH","country":"Republic of the Philippines","bounding_box":{"type":"Polygon","coordinates":[[[120.998880,14.513482],[120.998880,14.579517],[121.067544,14.579517],[121.067544,14.513482]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"shnmndza","name":"Call me by your name","id":1712695734,"id_str":"1712695734","indices":[0,9]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1538651798246"}

{"created_at":"Thu Oct 04 11:16:36 +0000 2018","id":1047807696135577600,"id_str":"1047807696135577600","text":"\ud83d\udc9a Phoenix\n\ud83d\udc9b Fleming\n\u2764\ufe0f Positron\n\ud83d\udc99 Becquerel\n\n\ud83d\ude2d\ud83d\udc96\ud83d\udc96 \n\u00a9\ufe0f Sa lahat ng owners \ud83d\udc9e\ud83d\ude18 https:\/\/t.co\/23dN8U8iqA","display_text_range":[0,74],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2823661652,"id_str":"2823661652","name":"baby nice","screen_name":"Amie_025","location":"x- becquerel","url":null,"description":"Moonlight \/\/ A.G.","translator_type":"none","protected":false,"verified":false,"followers_count":285,"friends_count":253,"listed_count":0,"favourites_count":23249,"statuses_count":7853,"created_at":"Sun Sep 21 07:33:00 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1046059645268373507\/TCuIxSDN_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1046059645268373507\/TCuIxSDN_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2823661652\/1537706766","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":{"id":"07d9f64958883000","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/07d9f64958883000.json","place_type":"poi","name":"City Of Mandaluyong Science High School","full_name":"City Of Mandaluyong Science High School","country_code":"PH","country":"Republic of the Philippines","bounding_box":{"type":"Polygon","coordinates":[[[121.035349,14.568841],[121.035349,14.568841],[121.035349,14.568841],[121.035349,14.568841]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[],"media":[{"id":1047807608189382657,"id_str":"1047807608189382657","indices":[75,98],"media_url":"http:\/\/pbs.twimg.com\/media\/DoqPd9cUcAEtkes.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/DoqPd9cUcAEtkes.jpg","url":"https:\/\/t.co\/23dN8U8iqA","display_url":"pic.twitter.com\/23dN8U8iqA","expanded_url":"https:\/\/twitter.com\/Amie_025\/status\/1047807696135577600\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":1024,"h":768,"resize":"fit"},"large":{"w":1024,"h":768,"resize":"fit"},"small":{"w":680,"h":510,"resize":"fit"}}}]},"extended_entities":{"media":[{"id":1047807608189382657,"id_str":"1047807608189382657","indices":[75,98],"media_url":"http:\/\/pbs.twimg.com\/media\/DoqPd9cUcAEtkes.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/DoqPd9cUcAEtkes.jpg","url":"https:\/\/t.co\/23dN8U8iqA","display_url":"pic.twitter.com\/23dN8U8iqA","expanded_url":"https:\/\/twitter.com\/Amie_025\/status\/1047807696135577600\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":1024,"h":768,"resize":"fit"},"large":{"w":1024,"h":768,"resize":"fit"},"small":{"w":680,"h":510,"resize":"fit"}}},{"id":1047807628154261504,"id_str":"1047807628154261504","indices":[75,98],"media_url":"http:\/\/pbs.twimg.com\/media\/DoqPfH0UUAA9X25.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/DoqPfH0UUAA9X25.jpg","url":"https:\/\/t.co\/23dN8U8iqA","display_url":"pic.twitter.com\/23dN8U8iqA","expanded_url":"https:\/\/twitter.com\/Amie_025\/status\/1047807696135577600\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"large":{"w":2048,"h":1536,"resize":"fit"},"medium":{"w":1200,"h":900,"resize":"fit"},"small":{"w":680,"h":510,"resize":"fit"}}},{"id":1047807655043989514,"id_str":"1047807655043989514","indices":[75,98],"media_url":"http:\/\/pbs.twimg.com\/media\/DoqPgr_VAAocbBU.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/DoqPgr_VAAocbBU.jpg","url":"https:\/\/t.co\/23dN8U8iqA","display_url":"pic.twitter.com\/23dN8U8iqA","expanded_url":"https:\/\/twitter.com\/Amie_025\/status\/1047807696135577600\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"large":{"w":1728,"h":816,"resize":"fit"},"medium":{"w":1200,"h":567,"resize":"fit"},"small":{"w":680,"h":321,"resize":"fit"}}},{"id":1047807680344023040,"id_str":"1047807680344023040","indices":[75,98],"media_url":"http:\/\/pbs.twimg.com\/media\/DoqPiKPU4AA8g5p.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/DoqPiKPU4AA8g5p.jpg","url":"https:\/\/t.co\/23dN8U8iqA","display_url":"pic.twitter.com\/23dN8U8iqA","expanded_url":"https:\/\/twitter.com\/Amie_025\/status\/1047807696135577600\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":1024,"h":768,"resize":"fit"},"small":{"w":680,"h":510,"resize":"fit"},"large":{"w":1024,"h":768,"resize":"fit"}}}]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"tl","timestamp_ms":"1538651796741"}
mgchco
  • 13
  • 3

1 Answers1

0

The program that you show only splits the file and then reads one of them. If there's no processing involved you don't need to parse json and then serialize it back. Split it as text:

import json

def writeJsonFile(json,tweetCount):
        filenumber = "tweets/tweet%s.json" %tweetCount
        with open (filenumber,'w') as f:
            f.write(json)

def getJsonLine(filename):
    count = 0
    with open(filename) as infp:
        for line in infp:
            if line.strip():
                count +=1  
                writeJsonFile(line, count) 

def readJsonFile(filename):
    with open (filename) as f:
        data = json.load(f)
        print(data['id'])

if __name__ == '__main__':
    getJsonLine("largeJsonFile.json")
    readJsonFile("outputJsonFile1.json")

Additionally if you are processing the input file only once and need to get specific line from it there is no need to write all jsons to individual files. You can find the line and parse only it.

If you need to do this multiple times, you can do one scan of the file, find all offsets of the line beginning. Then you can do f.seek(offset) to appropriate line in the file and parse only one line. This will save you the time needed to write individual json files and place on disk as well.

  • Thanks! I will try this. I also tried json.loads() on each line. I'm not really sure if that's what I'm looking for. – mgchco Nov 28 '18 at 16:07
  • Hi! I tried your code and I'm getting an error. Type error: write() argument must be str, not dict – mgchco Dec 12 '18 at 11:06
  • Probably you haven't applied all the changes or modified this code. In the above example `json` argument is a string (that's the point of my answer - to not parse the string into dict and write it to file directly as string) so you need to check that you really removed the parsing. Also note that this code just demonstrates the idea I haven't tested it. – Roman-Stop RU aggression in UA Dec 12 '18 at 11:39