-3

I made a big mistake, when I choose the way of dumping data; Now I have a text file, that consist of

{ "13234134": ["some", "strings", ...]}{"34545345": ["some", "strings", ...]} ..so on

How can I read it into python?

edit: I have tried json, when I add at begin and at end of file curly-braces manually, I have "ValueError: Expecting property name:", because "13234134" string maybi invalid for json, I do not know how to avoid it.

edit1

with open('new_file.txt', 'w') as outfile:
    for index, user_id in enumerate(users):
        json.dump(dict = get_user_tweets(user_id), outfile)
Kirill Golikov
  • 1,354
  • 1
  • 13
  • 27

2 Answers2

3

It looks like what you have is an undelimited stream of JSON objects. As if you'd called json.dump over and over on the same file, or ''.join(json.dumps(…) for …). And, in fact, the first one is exactly what you did. :)

So, you're in luck. JSON is a self-delimiting format, which means you can read up to the end of the first JSON object, then read from there up to the end of the next JSON object, and so on. The raw_decode method essentially does the hard part.

There's no stdlib function that wraps it up, and I don't know of any library that does it, but it's actually very easy to do yourself:

def loads_multiple(s):
    decoder = json.JSONDecoder()
    pos = 0
    while pos < len(s):
        pos, obj = decoder.raw_decode(s, pos)
        yield obj

So, instead of doing this:

obj = json.loads(s)
do_stuff_with(obj)

… you do this:

for obj in loads_multi(s):
    do_stuff_with(obj)

Or, if you want to combine all the objects into one big list:

objs = list(loads_multi(s))
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Neat! Well explained too. Thanks. – Oliver W. Apr 17 '15 at 22:04
  • In case you're wondering, I've written this before because JSON-RPC over TCP is effectively your format (except you also have to deal with the possibility that you only have part of an object), and surprisingly nobody had really implemented it properly. At least in Python you can use `raw_decode`; for JavaScript and Ruby, I couldn't find a library that had an equivalent, and had to write my own… – abarnert Apr 17 '15 at 22:06
2

Consider simply rewriting it to something that is valid json. If indeed your bad data only contains the format that you've shown (a series of json structures that are not comma-separated), then just add commas and square braces:

with open('/tmp/sto/junk.csv') as f:
    data = f.read()

print(data)
s = "[ {} ]".format(data.strip().replace("}{", "},{"))
print(s)
import json
data = json.loads(s)
print(type(data))

Output:

{ "13234134": ["some", "strings"]}{"34545345": ["some", "strings", "like", "this"]}

[ { "13234134": ["some", "strings"]},{"34545345": ["some", "strings", "like", "this"]} ]
<class 'list'>
Oliver W.
  • 13,169
  • 3
  • 37
  • 50
  • This is somewhat brittle; e.g., `}{` can appear in the middle of a string, in which case you'll break the string. – abarnert Apr 17 '15 at 22:04
  • @abarnert, entirely correct. I was assuming the strings were well-behaved. That'd get me fired if I was working with databases in real-life. I do hope the OP accepts your answer, you already have my upvote. Nevertheless I'll leave mine, because it will get repeated of course. And in x months from now, who will add the correct comment, pointing out the potential flaw? – Oliver W. Apr 17 '15 at 22:14
  • @OliverW., I will use this solution, because I only interested in normalized words of each user: my next step is deleting all non-letter symbols and creation vocabulary. Here is no matter is there curly-braces or not, they will be deleted. I can't say anything about correctness of this approach in common case. Thank you! – Kirill Golikov Apr 17 '15 at 22:31