1

I have a text file with a huge dictionary - and it looks like this:

{"0_3":[(80.10858194902539,-175.29917925596237,1)   ],"10_10":[(50.610770881175995,-57.17018913477659,1)    , (52.946319971233606,-66.9017181918025,1)].........}

It's approximately 138 mb in size, and I need to use this dictionary and access values in my python code. So, I have the following code fragment (diction.txt is the file, and I want the dictionary in my spots variable):

with open("diction.txt","r") as myfile:
    data = myfile.read().replace('\n','')

exec("spots = " + data)

But, when I run this, I get a memory error, and I am not sure of this is because of the size of the file or something else, and if the size is the problem, how can I make it work?

Thanks for your help!

edit: SOLUTION:

The solution, as pointed by @DrV in the comments was to get rid of the parantheses in my file, as JSON does not recognize tuples, with the following code:

import json 

with open("diction.txt","r") as myfile:
    data = myfile.read().replace('\n','').replace('(','').replace(')','')
spots = json.loads(data)

And then changing the rest of my code to accommodate for the fact that I changed the format from tuples to a continuous list.

aishpr
  • 143
  • 3
  • 15
  • Where do you get the exception? I think you have to use mmap (https://docs.python.org/2/library/mmap.html) to work with the file. – Christian Berendt Jun 26 '14 at 17:52
  • 2
    @Christian Berendt: I do not think a 138 MB file is a big one, i.e. the problem is most probably with `exec`. Switching to 64-bit python would probably remove the exception, but `MemoryError` is a warning sign about something using memory in the gigabyte range. Memory mapping is a handy trick with random access files, buth this one should be reasonably parseable with linear sequential access where the operating system helps a lot. – DrV Jun 26 '14 at 17:56
  • 1
    Just a small comment: If you are not using `data` anywhere else, you do not need to assign it a new variable, and you do not need to replace the newlines. That will make it a bit faster and less memory-intensive (the 138 MB will be allocated until you get rid of `data` in your code). Just: `spots = json.loads(myfile.read().replace('(','[').replace(')',']'))` – DrV Jun 26 '14 at 20:17
  • Oh, okay! Since json.loads basically ignores all whitespace? Thanks! – aishpr Jun 26 '14 at 20:19

2 Answers2

6

Using exec and eval is always a bit dangerous and best avoided. It seems that your data structure could be evaluated with:

import ast

with open("diction.txt","r") as myfile:
    data = myfile.read().replace('\n','')

mydata = ast.literal_eval(data)

The difference here is that ast.literal_eval does not treat your data as program code but as data. The procedure is much lighter and safer.

However, others have reported challenges with even with ast.literal_eval. It is still more complex than what you would need:

Loading 41MB file by ast.literal_eval causes MemoryError

If you have any possibility to change the format of the file to be JSON-compliant, then you could use the json module in writing and reading it. JSON data is after all more common than python dictionary dumps. Your data seems to be JSON apart form the use of tuples. If you change them into lists, you should be good to go.

For a bit more discussion on these options, see:

python eval vs ast.literal_eval vs JSON decode

If (and probably when) you end up with JSON, there are different libraries for it. If the standard python json is too slow in the decoding phase, you may use, e.g. ujson which has been advertized to be very fast.

Community
  • 1
  • 1
DrV
  • 22,637
  • 7
  • 60
  • 72
  • I tried this, but I got this error: Traceback (most recent call last): File "tilegen_buckets.py", line 436, in spots = ast.literal_eval(data) File "/usr/lib/python2.7/ast.py", line 49, in literal_eval node_or_string = parse(node_or_string, mode='eval') File "/usr/lib/python2.7/ast.py", line 37, in parse return compile(source, filename, mode, PyCF_ONLY_AST) MemoryError – aishpr Jun 26 '14 at 18:32
  • 1
    @aishpr: See my edit, you are not alone... Probably the same underlying reason. It is strange, because your data is small, maybe 50 MB in binary form. So JSON looks like the route to take. If your file continues as it starts, just replace parentheses with square brackets. – DrV Jun 26 '14 at 18:52
2

it sounds like you can maybe just do

with open("diction.txt","r") as myfile:
    data = json.load(myfile)

it may raise some errors .. its hard to tell ... but if you can encode your big file as json instead that will probably help alot

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Yes, this is giving me errors as well. it says no JSON object could be decoded when I tried the huge file, but when I tried it on a smaller one, it worked. Can you give me pointers as to how my input file must be formatted in JSON? – aishpr Jun 26 '14 at 18:35
  • 3
    @aishpr: Actually, if you can replace the parentheses () with square brackets [] in your tuples, the file could pass as JSON. JSON does not have tuples, only lists. At least the data you show can be loaded with `json.loads(s.replace('(','[').replace(')',']')`. This is ugly and memory-consuming, but maybe worth a try. – DrV Jun 26 '14 at 18:42
  • Yeah, I did this a while back, and I changed the rest of the code to adjust to this, and just tried it, and it works! Glad, someone else thought about the same thing as me! :) Thanks guys! – aishpr Jun 26 '14 at 19:12