How do I keep only ascii and discard non-ascii, nbsp, etc while doing json.dumps

Question

I read csv files using csv reader, and then convert it into a json file using dictionary.
In doing so, I would like only letters and numbers with no non-ascii characters or nbsp. I am trying to do it like this:

with open ('/file', 'rb') as file_Read:
     reader = csv.reader(file_Read)
     lis = []
     di = {}
     for r in reader:
         di = {r[0].strip():[some_val]}
         lis.append(di)

with open('/file1', 'wb') as file_Dumped:
     list_to_be_written = json.dumps(lis)
     file_Dumped.write(liss)

When I read the file, the output, it consists of sequences like \xa0\xa0\xa0\xa0 along with the keys.
Ex - {"name \xa0\xa0\xa0\xa0":[9]}
If I do json.dumps(lis,ensure_ascii=False) then I see blank spaces surrounding the keys.
Ex - {"name ":[9]}
How do I completely remove everything but letters and digits?

seems like duplicate of https://stackoverflow.com/questions/8689795/how-can-i-remove-non-ascii-characters-but-leave-periods-and-spaces-using-python — Harish Kumar, Jan 30 '20 at 09:07
`import string` `printable = set(string.printable)` `''.join(filter(lambda x: x in printable, list_to_be_written))` — Harish Kumar, Jan 30 '20 at 09:19
@HarishKumar That's mighty helpful, Sir. I added strip() and it gave me the desired result. — Mr.President, Jan 30 '20 at 09:19

Dmitry Shevchenko · Answer 1 · 2020-01-30T09:26:07.383

1

If spaces are only at the end of a line, you can use .strip(). If you need to leave spaces between ascii characters, you can use something like this:

my_string.replace('  ', '').strip()

To remove non-ascii characters, try this:

my_string = 'name  \xa0\xa0\xa0\xa0'
my_string.encode('ascii', 'ignore').strip()

edited Jan 30 '20 at 09:26

answered Jan 30 '20 at 09:00

Dmitry Shevchenko

468
2
13

Thanks for replying, Sir. I have already removed trailing/leading whitespaces there (first line in for loop). Consider this - `s = '\xef\xbb\xbf name1'`. If you type print s, on Python idle, output would be `name1`. If you type s output would be `'\xef\xbb\xbf name1'`. How do I remove that `'\xef\xbb\xbf '` ? – Mr.President Jan 30 '20 at 09:08
Try something like this: `my_string = 'name \xa0\xa0\xa0\xa0'` `my_string.encode('ascii', 'ignore').strip()` – Dmitry Shevchenko Jan 30 '20 at 09:21
It gives this error - UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 5: ordinal not in range(128) – Mr.President Jan 30 '20 at 09:29
Please look at a solution to a similar problem here (https://stackoverflow.com/questions/46154561/remove-zero-width-space-unicode-character-from-python-string) – Dmitry Shevchenko Jan 30 '20 at 09:52

score -1 · Answer 2 · edited Feb 10 '20 at 08:46

You can try this:

import pandas as pd
import json
# Read the csv file using pandas
df = pd.read_csv("YourInputCSVFile")

#Convert all column types to str in order to remove non-ascii characters
df = df.astype(str)

#Iterate between all columns in order to remove non-ascii characters
for column in df:
    df[column] = df[column].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#Convert the dataframe to dictionary for json conversion
df_dict = df.to_dict()

#Save the dictionary contents to a json file
with open('data.json', 'w') as fp:
    json.dump(df_dict, fp)

How do I keep only ascii and discard non-ascii, nbsp, etc while doing json.dumps

2 Answers2