I'm downloading files from S3 that contains JSON (like) data which I intend to parse into a Pandas dataframe using pd.read_json
.
My problem is that the files dumped into the S3 bucket use an 'octal escape' formatting for non english characters but Python/Pandas objects to the fact that an escape for the \
character is also included.
An example would be the string: "destination":"Provence-Alpes-C\\303\\264te d\'Azur"
Which prints as:
If I manually remove one of the \
characters then Python happily interprets the string and it prints as:
There is some good stuff in this thread and although .decode('string_escape')
works well on an individual snippet, when its part of the much longer string comprising thousands of records then it doesn't work.
I believe that I need a clever way to replace the \\
with \
but for well documented reasons, .replace('\\', '\')
doesn't work.
In order to get the files to work at all I used a regex to remove all \
followed by a number: re.sub(r'\\(?=[0-9])', '', g)
- I'm thinking that an adaptation of this might be the way forward but the number needs to be dynamic as I don't know what it will be (i.e. using \3
and \2
for the example above isn't going to work')
Help appreciated.