How to read text file in python with unsuccessful format?

Question

I made a big mistake, when I choose the way of dumping data; Now I have a text file, that consist of

{ "13234134": ["some", "strings", ...]}{"34545345": ["some", "strings", ...]} ..so on

How can I read it into python?

edit: I have tried json, when I add at begin and at end of file curly-braces manually, I have "ValueError: Expecting property name:", because "13234134" string maybi invalid for json, I do not know how to avoid it.

edit1

with open('new_file.txt', 'w') as outfile:
    for index, user_id in enumerate(users):
        json.dump(dict = get_user_tweets(user_id), outfile)

How are you trying to read it in, and how is that succeeding or failing? — Robᵩ, Apr 17 '15 at 21:49
Is the only error the lack of commas between dictionaries/json elements? Are there nested sub dicts/json blocks? — Matt Davidson, Apr 17 '15 at 21:50
possible duplicate of [Parsing values from a JSON file in Python](http://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file-in-python) — Jose Ricardo Bustos M., Apr 17 '15 at 21:53
@JoseRicardoBustosM.: I don't think so; he needs to parse what looks like a stream of JSON objects, not a single damaged JSON object. — abarnert, Apr 17 '15 at 21:56
@abarnert, with open('file.txt', 'w') as outfile: for index, user_id in enumerate(many_users): json.dump(dict = get_user_tweets(user_id), outfile) — Kirill Golikov, Apr 17 '15 at 22:00
@abarnert OP just edited , you're right , but at the beginning it looked like it wanted to read json — Jose Ricardo Bustos M., Apr 17 '15 at 22:01
@JoseRicardoBustosM.: You could already guess from the OP's sample data (as Matt Davidson also apparently did) before any of the edits. — abarnert, Apr 17 '15 at 22:04
ValueError: Expecting property name:" ..... the problem is in JSON file format .... use a JSON validator online [http://jsonlint.com/](http://jsonlint.com/) — Jose Ricardo Bustos M., Apr 17 '15 at 22:10

score 3 · Answer 1 · answered Apr 17 '15 at 21:59

It looks like what you have is an undelimited stream of JSON objects. As if you'd called json.dump over and over on the same file, or ''.join(json.dumps(…) for …). And, in fact, the first one is exactly what you did. :)

So, you're in luck. JSON is a self-delimiting format, which means you can read up to the end of the first JSON object, then read from there up to the end of the next JSON object, and so on. The raw_decode method essentially does the hard part.

There's no stdlib function that wraps it up, and I don't know of any library that does it, but it's actually very easy to do yourself:

def loads_multiple(s):
    decoder = json.JSONDecoder()
    pos = 0
    while pos < len(s):
        pos, obj = decoder.raw_decode(s, pos)
        yield obj

So, instead of doing this:

obj = json.loads(s)
do_stuff_with(obj)

… you do this:

for obj in loads_multi(s):
    do_stuff_with(obj)

Or, if you want to combine all the objects into one big list:

objs = list(loads_multi(s))

In case you're wondering, I've written this before because JSON-RPC over TCP is effectively your format (except you also have to deal with the possibility that you only have part of an object), and surprisingly nobody had really implemented it properly. At least in Python you can use `raw_decode`; for JavaScript and Ruby, I couldn't find a library that had an equivalent, and had to write my own… — abarnert, Apr 17 '15 at 22:06

score 2 · Accepted Answer · answered Apr 17 '15 at 21:58

2

Consider simply rewriting it to something that is valid json. If indeed your bad data only contains the format that you've shown (a series of json structures that are not comma-separated), then just add commas and square braces:

with open('/tmp/sto/junk.csv') as f:
    data = f.read()

print(data)
s = "[ {} ]".format(data.strip().replace("}{", "},{"))
print(s)
import json
data = json.loads(s)
print(type(data))

Output:

{ "13234134": ["some", "strings"]}{"34545345": ["some", "strings", "like", "this"]}

[ { "13234134": ["some", "strings"]},{"34545345": ["some", "strings", "like", "this"]} ]
<class 'list'>

answered Apr 17 '15 at 21:58

Oliver W.

13,169
3
37
50

This is somewhat brittle; e.g., `}{` can appear in the middle of a string, in which case you'll break the string. – abarnert Apr 17 '15 at 22:04
@abarnert, entirely correct. I was assuming the strings were well-behaved. That'd get me fired if I was working with databases in real-life. I do hope the OP accepts your answer, you already have my upvote. Nevertheless I'll leave mine, because it will get repeated of course. And in x months from now, who will add the correct comment, pointing out the potential flaw? – Oliver W. Apr 17 '15 at 22:14
@OliverW., I will use this solution, because I only interested in normalized words of each user: my next step is deleting all non-letter symbols and creation vocabulary. Here is no matter is there curly-braces or not, they will be deleted. I can't say anything about correctness of this approach in common case. Thank you! – Kirill Golikov Apr 17 '15 at 22:31

How to read text file in python with unsuccessful format?

2 Answers2