I have multiple JSON files filled with strings that can get up to several hundred lines. I'll only have three lines in my example of the file, but on average there are about 200-500 of these "phrases":
{
"version": 1,
"data": {
"phrases":[
"A few words that's it.",
"This one, has a comma in it!",
"hyphenated-sentence example"
]
}
}
I need to have a script go in to the file (we can call it ExampleData.json) and remove all punctuation (specifically these characters: ,.?!'-
from the file, without removing the ,
outside of the double quotation marks. Essentially so that this:
"A few words that's it.",
"This one, has a comma in it!",
"hyphenated-sentence example."
Becomes this:
"A few words that's it",
"This one has a comma in it",
"hyphenated sentence example"
Also note how all the punctuation gets removed except for the hyphen. That gets replaced with a space.
I've found a near identical question like this posed but for csv files here, but haven't been able to translate the csv version into something that will work with JSON.
The closest I've gotten with python was with a string via someone else's answer on a different thread.
input_str = 'please, remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
(Sorry, I don't know how to put code blocks in a quote)
But it only works for a single character at a time. But by adding any extra rules causes the text outside the quotes to double themselves (please becomes pplleeaassee).
One last thing is that I have to do this in python2.7.5, which from what I've put together searching around, makes this a bit more difficult.
I'm sorry that I'm still this new to python and have to do something this non-trivial right away, but it wasn't really my choice.