1

I've a 1GB json file with very long lines, when I try to load a line from the file I get this error from PyCharm console:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2017.3.3\helpers\pydev\pydev_run_in_console.py", line 53, in run_file
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "......... .py", line 26, in <module>
    for line in f:
MemoryError
PyDev console: starting.
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32

I'm on a Windows Server machine with 64GB of RAM.

My code is:

import numpy as np
import json
import sys
import re

idRegEx = re.compile(r".*ID=")
endElRegEx = re.compile(r"'.*")

ratingsFile = sys.argv[1]
tweetsFile = sys.argv[2]
outputFile = sys.argv[3]

tweetsMap = {}
with open(tweetsFile, "r") as f:

    for line in f:
        tweetData = json.loads(line)
        tweetsMap[tweetData["key"]] = tweetData

output = open(outputFile, "w")

with open(ratingsFile, "r") as f:
    header = f.next()

    for line in f:
        topicData = line.split("\t")

        topicKey = topicData[0]
        topicTerms = topicData[1]
        ratings = topicData[2]
        reasons = topicData[3]

        ratings = map(lambda x: int(x.strip().replace("'", "")), ratings.replace("[", "").replace("]", "").split(","))
        ratings = np.array(ratings)

        tweetsMap[topicKey]["ratings"] = ratings.tolist()
        tweetsMap[topicKey]["mean"] = ratings.mean()

        topicMap = tweetsMap[topicKey]

        print topicMap["key"], topicMap["mean"]

        json.dump(topicMap, output, sort_keys=True)
        output.write("\n")

output.close()

Line 26 in the error message refers to

tweetData = json.loads(line)

while line 53 refers to

json.dump(topicMap, output, sort_keys=True)

The strange thing is that I forked this code from GitHub and so I think it should work.

ocram
  • 93
  • 1
  • 10
  • did you fork the data as well? Your data is too big - why on earth do you need 1GB json of tweets in one file? – Patrick Artner Mar 10 '18 at 17:18
  • 3
    Possible duplicate of [Is there a memory efficient and fast way to load big json files in python?](https://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python) – Martin Sand Christensen Mar 10 '18 at 17:19
  • @PatrickArtner: there may be many good reasons to read such large JSON objects, but in the case of tweets, I'd certainly prefer to put one tweet per line and read them separately rather than serialising and unserialising a great honking list of them. – Martin Sand Christensen Mar 10 '18 at 17:23
  • yes, I forked the exact same data as well. I prefer to have a single json file to be consistent with the original project that I forked and because I need to start from it as base for my work – ocram Mar 10 '18 at 17:25

1 Answers1

1

It looks like you're using a 32-bit version of Python:

Python 2.7.14 (...) [MSC v.1500 32 bit (Intel)] on win32

It has a memory limit of 2GB per process on Windows, so that's why you're getting the memory error even though you have plenty of RAM. Switching to the 64-bit version of Python should fix your issue, in case you don't want to change your script.

devius
  • 2,736
  • 22
  • 26
  • Thank you, I'll try! Can I install it alongside the 32-bit version or is better to uninstall it before? – ocram Mar 10 '18 at 17:27
  • Check out [this Stack Overflow question](https://stackoverflow.com/questions/10187072/how-do-i-install-python-2-7-3-32-bit-and-64-bit-on-windows-side-by-side) for that info, although I don't see any reason to keep 32-bit Python around if the 64-bit version is installed. – devius Mar 10 '18 at 17:30