1

I have multiple JSON files filled with strings that can get up to several hundred lines. I'll only have three lines in my example of the file, but on average there are about 200-500 of these "phrases":

{
   "version": 1,
   "data": {
       "phrases":[
           "A few words that's it.",
           "This one, has a comma in it!",
           "hyphenated-sentence example"
        ]
   }
}

I need to have a script go in to the file (we can call it ExampleData.json) and remove all punctuation (specifically these characters: ,.?!'- from the file, without removing the , outside of the double quotation marks. Essentially so that this:

"A few words that's it.",
"This one, has a comma in it!",
"hyphenated-sentence example."

Becomes this:

"A few words that's it",
"This one has a comma in it",
"hyphenated sentence example"

Also note how all the punctuation gets removed except for the hyphen. That gets replaced with a space.


I've found a near identical question like this posed but for csv files here, but haven't been able to translate the csv version into something that will work with JSON.

The closest I've gotten with python was with a string via someone else's answer on a different thread.

input_str = 'please, remove all the commas between quotes,"like in here, here, here!"'

quotes = False

def noCommas(string):
    quotes = False
    output = ''
    for char in string:
        if char == '"':
            quotes = True
        if quotes == False:
            output += char
        if char != ',' and quotes == True:
            output += char
    return output

print noCommas(input_str)

(Sorry, I don't know how to put code blocks in a quote)
But it only works for a single character at a time. But by adding any extra rules causes the text outside the quotes to double themselves (please becomes pplleeaassee).
One last thing is that I have to do this in python2.7.5, which from what I've put together searching around, makes this a bit more difficult.
I'm sorry that I'm still this new to python and have to do something this non-trivial right away, but it wasn't really my choice.

bonzo
  • 312
  • 2
  • 4
  • 19
  • try to load your json as dict, then process your strings to remove unwanted characters using `re.sub` or `str.translate` as this answer suggest (https://stackoverflow.com/a/3939381/8053370) and then saving it again into your file. – VictorGalisson Nov 06 '19 at 16:03
  • I've been able to BS my way through most of the logic. I `open(` the .json file as `fin` and apply `data = fin.read()` `data = data.replace('?','')` to all my applicable characters, except the comma. All that's left is to figure out how to decide whether a comma is inside the double quotes or not. The approaches I can think of are: if the comma is next to a `\n`, if it's next to a double quote, or if it resides inside two quotes. Still don't know if one of these or another route is the better option. – bonzo Nov 06 '19 at 22:54

1 Answers1

4

This should work.

import re
import json

with open('C:/test/data.json') as json_file:
    data = json.load(json_file)



for idx, v in enumerate(data['data']['phrases']):
    data['data']['phrases'][idx] = re.sub(r'-',' ',data['data']['phrases'][idx])
    data['data']['phrases'][idx] = re.sub(r'[^\w\s]','',data['data']['phrases'][idx])


with open('C:/test/data.json', 'w') as outfile:
    json.dump(data, outfile,  indent=4)

Option 2:

Load in the json as a string. Then use regex to find all substrings between double quotes. Replace/strip the punctuation from all those substrings, then write back to file:

import re
import json
import string




with open('C:/test/data.json') as json_file:
    data = json.load(json_file)

data = json.dumps(data)

strings = re.findall(r'"([^"]*)"', data)

for each in strings:
    new_str =  re.sub(r'-',' ', each)
    new_str = new_str.strip(string.punctuation)
    new_str =  re.sub(r',','', new_str)

    data = data.replace('"%s"' %each, '"%s"' %new_str)


with open('C:/test/data_output.json', 'w') as outfile:
    json.dump(json.loads(data), outfile,  indent=4)
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Sorry, I should have specified in the OP that these are separate JSON files that can get up to several hundred lines of strings. So I wouldn't have the actual JSON in the python script, but as an individual file that I am editing. I will update my post to better reflect that. – bonzo Nov 06 '19 at 17:05
  • Ah ok. I’ll give it look tomorrow morning with a solution that should work. Just to clarify, these phrases are all through out? You basically want the the punctuation removed form all values? Regardless of its key? – chitown88 Nov 06 '19 at 22:53
  • Yeah. So those three example phrases, pretend there around 500 of them. That's the entire file. Also I should note that in my .json files, the `data =` before the `{` doesn't exist. I'd change it if I could but I'm not the one generating these files. – bonzo Nov 06 '19 at 22:56
  • ya the `data =` is just where you'd read in the file. is it possible to email me the json file (if it's not sensitive information)? – chitown88 Nov 07 '19 at 10:10
  • 1
    @chitown88 imo your first option 1 is fine, you can just load the json like you do in option 2 and that's all. – VictorGalisson Nov 07 '19 at 13:28
  • 1
    @VictorGalisson, I agree (i mentioned that in the comments, but actually just went and edited it in the solution).. But I was starting to think if there are nested values, or if the json doesn't specifically have `"phrases"` as a key. Without knowing exactly what the json looks like, I was just trying to give a more robust way. I know it still has flaws, but atleast gives another option to work with – chitown88 Nov 07 '19 at 13:33
  • @chitown88 make sense, I was going to suggest a more robust solution but you did it already :) – VictorGalisson Nov 07 '19 at 13:41
  • @VictorGalissonif you have a solution post it! I always like to see as many different ways to attack the same problem. It's the best way to learn! – chitown88 Nov 07 '19 at 14:24
  • @chitown88, I can't email any of the files for security reasons. :\ I can tell you that other than the actual content of the sentences, the structure of the files is identical. I'm accepting your response, but I'd like to clarify a thing or two. I put together a "solution" last night where I have `data = data.replace('?','')` for each character, for the comma problem, I replace all `",`, to `"@`, then change all commas in the file to '', then change `"@` back to `",`. I have the strong feeling that this is an inefficient way of doing it and risks having something replace incorrectly. – bonzo Nov 07 '19 at 14:50
  • Ran out of time/characters and wanted to clarify a few more things. I tried out your option 2 and as far as I can tell I copied everything the same but it outputs as a single line showing the `\n` regex as well. Your first option works perfectly however. Bringing me to my other point. for `phrases` in `for idx, v in enumerate(data['data']['phrases']):` would there be a way to have that set as a variable that the user then sets at runtime? I want to play around with these scripts and try and make it user-friendly. It's nbd if it can't/would be a whole new question. Just wondering. – bonzo Nov 07 '19 at 15:24
  • @bonzo. Sorry, I missed one thing on the `json.dump()` part of option 2. I edited it so now it should correctly write to file. I'm not sure I fully understand what you want to do with the user input. But yes it can be done. Shoot me an email, might be easier to chat "offline" jason.schvach@gmail.com – chitown88 Nov 07 '19 at 17:23