10

I have a file where on each line I have text like this (representing cast of a film):

[{'cast_id': 23, 'character': "Roger 'Verbal' Kint", 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie's Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]

I need to convert it in a valid json string, thus converting only the necessary single quotes to double quotes (e.g. the single quotes around word Verbal must not be converted, eventual apostrophes in the text also should not be converted).

I am using python 3.x. I need to find a regular expression which will convert only the right single quotes to double quotes, thus the whole text resulting in a valid json string. Any idea?

revy
  • 3,945
  • 7
  • 40
  • 85
  • 5
    What produced the file? The right thing to do is parse it as a list of dictionaries, then encode it with `json.dump`. A regular expression is right out; this is not a regular language. – chepner Dec 05 '17 at 17:57
  • 1
    `import json;json.dumps(your_dict)` – Amit Tripathi Dec 05 '17 at 17:57
  • 1
    @AmitTripathi It's not a `dict` yet; it's a string in a file. – chepner Dec 05 '17 at 17:57
  • the string as shown above has a syntax error in the first place. –  Dec 05 '17 at 18:01
  • 2
    You have a serious problem with that input: the value `Edie's Finneran` is enclosed in single quotes; no parser is going to be able to tell that the apostrophe is not a closing quote. You going to have to fix whatever is producing that file, in which case you may as well have it output JSON in the first place. – chepner Dec 05 '17 at 18:01
  • @chepner yeah right. Json dumps cant be used here. – Amit Tripathi Dec 05 '17 at 18:02
  • 1
    you still haven't anwered the question: where does this string come from? why is it not already json compatible? how much of it is there? –  Dec 05 '17 at 18:05
  • When you go to the doctor's do you want them to prescribe you medication to help your symptoms (but mask the overall problem) or do you want them to prescribe medication that will fix whatever is causing the symptoms in the first place? i.e. Do you want the doctor to fix your cough or do you want them to cure you of your cold? – ctwheels Dec 05 '17 at 18:15
  • @hop Here it is just a line. The whole file is about 11000 rows – revy Dec 05 '17 at 18:50
  • and you still haven't answered most of the questions… –  Dec 05 '17 at 21:36

5 Answers5

13

First of all, the line you gave as example is not parsable! … 'Edie's Finneran' … contains a syntax error, not matter what.

Assuming that you have control over the input, you could simply use eval() to read in the file. (Although, in that case one would wonder why you can't produce valid JSON in the first place…)

>>> f = open('list.txt', 'r')
>>> s = f.read().strip()
>>> l = eval(s)

>>> import pprint
>>> pprint.pprint(l)
[{'cast_id': 23,
  'character': "Roger 'Verbal' Kint",
  ...
  'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]

>>> import json
>>> json.dumps(l)
'[{"cast_id": 23, "character": "Roger \'Verbal\' Kint", "credit_id": "52fe4260ca36847f8019af7", "gender": 2, "id": 1979, "name": "Kevin Spacey", "order": 5, "rofile_path": "/x7wF050iuCASefLLG75s2uDPFUu.jpg"}, {"cast_id": 27, "character":"Edie\'s Finneran", "credit_id": "52fe4260c3a36847f8019b07", "gender": 1, "id":2179, "name": "Suzy Amis", "order": 6, "profile_path": "/b1pjkncyLuBtMUmqD1MztDSG80.jpg"}]'

If you don't have control over the input, this is very dangerous, as it opens you up to code injection attacks.

I cannot emphasize enough that the best solution would be to produce valid JSON in the first place.

3

If you do not have control over the JSON data, do not eval() it!

I created a simple JSON correction mechanism, as that is more secure:

def correctSingleQuoteJSON(s):
    rstr = ""
    escaped = False

    for c in s:
    
        if c == "'" and not escaped:
            c = '"' # replace single with double quote
        
        elif c == "'" and escaped:
            rstr = rstr[:-1] # remove escape character before single quotes
        
        elif c == '"':
            c = '\\' + c # escape existing double quotes
   
        escaped = (c == "\\") # check for an escape character
        rstr += c # append the correct json
    
    return rstr

You can use the function in the following way:

import json

singleQuoteJson = "[{'cast_id': 23, 'character': 'Roger \\'Verbal\\' Kint', 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie\\'s Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]"

correctJson = correctSingleQuoteJSON(singleQuoteJson)
print(json.loads(correctJson))
finnmglas
  • 1,626
  • 4
  • 22
  • 37
1

Here is the code to get desired output

import ast
def getJson(filepath):
    fr = open(filepath, 'r')
    lines = []
    for line in fr.readlines():
        line_split = line.split(",")
        set_line_split = []
        for i in line_split:
            i_split = i.split(":")
            i_set_split = []
            for split_i in i_split:
                set_split_i = ""
                rev = ""
                i = 0
                for ch in split_i:
                    if ch in ['\"','\'']:
                        set_split_i += ch
                        i += 1
                        break
                    else:
                        set_split_i += ch
                        i += 1
                i_rev = (split_i[i:])[::-1]
                state = False
                for ch in i_rev:
                    if ch in ['\"','\''] and state == False:
                        rev += ch
                        state = True
                    elif ch in ['\"','\''] and state == True:
                        rev += ch+"\\"
                    else:
                        rev += ch
                i_rev = rev[::-1]
                set_split_i += i_rev
                i_set_split.append(set_split_i)
            set_line_split.append(":".join(i_set_split))
        line_modified = ",".join(set_line_split)
        lines.append(ast.literal_eval(str(line_modified)))
    return lines
lines = getJson('test.txt')
for i in lines:
    print(i)
Tilak Putta
  • 758
  • 4
  • 18
0

Apart from eval() (mentioned in user3850's answer), you can use ast.literal_eval

This has been discussed in the thread: Using python's eval() vs. ast.literal_eval()?

You can also look at the following discussion threads from Kaggle competition which has data similar to the one mentioned by OP:

https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/89313#latest-517927 https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/80045#latest-518338

Kaushik Acharya
  • 1,520
  • 2
  • 16
  • 25
0
import ast
json_dat = json.dumps(ast.literal_eval(row['prod_cat']))
dict_dat = json.loads(json_dat)