1

I want to search a pattern in a string and then again search some invalid character in matching pattern and then remove them or replace with some valid characters.

I have some sample dictionaries eg. sample_dict = {"randomId":"123y" uhnb\n g", "desc": ["sample description"]}

In this case I want to find the value of a dictionary let say "123y" uhnb\n g" and then remove invalid characters in it such as (", \t, \n) etc.. what I have tried is stored all the dictionaries in a file then read file and matching pattern for dictionary value, but this gives me a list of matching pattern, I can also compile these matches but I am not sure how to perform replace in original dictionary value so my final output will be: {"randomId":"123y uhnb g", "desc": ["sample description"]}

pattern = re.findall("\":\"(.+?)\"", sample_dict)

expected result:

{"randomId":"123y uhnb g", "desc": ["sample description"]}

actual result:

['123y" uhnb\n g']
Alec
  • 8,529
  • 8
  • 37
  • 63
Pradeep
  • 475
  • 1
  • 4
  • 18
  • 2
    Don't parse JSON with regex, use a JSON parser – miken32 Apr 20 '19 at 05:47
  • Possible duplicate of [Parse JSON in Python](https://stackoverflow.com/questions/7771011/parse-json-in-python) – miken32 Apr 20 '19 at 05:48
  • @miken32: I can use json parser but in that case as well I need to remove those invalid characters else it won't work, so in order to remove those characters I am using regex. – Pradeep Apr 20 '19 at 06:06
  • How did you end up with this strange `sample_dict` to begin with? Perhaps you can avoid that already earlier so that you do not need to replace or remove the strange characters. – JohanL Apr 20 '19 at 06:32
  • Why are you using re.findall instead of re.sub (in some capacity)? – FailSafe Apr 20 '19 at 06:40
  • Secondly, why are you passing a dictionary to findall as opposed to a string? Are you implying that you convert the regex to a string, or are you inserting the randomid key's value? – FailSafe Apr 20 '19 at 06:42
  • As suggested, using re.sub is easier to replace non-alphanumeric characters, check my answer below to see how! @Pradeep – Devesh Kumar Singh May 03 '19 at 12:39

1 Answers1

1

You can just substitute non-alphanumeric characters in your value using re.sub as below

dct = {"randomId":"123y uhnb\n g", "desc": ["sample description"]}
import re

for key, value in dct.items():
    val = None
    #If the value is a string, directly substitute
    if isinstance(value, str):
       val = re.sub(r"[^a-zA-Z0-9 ]", '', str(value))
    #If value is a list, substitute for all string in the list
    elif isinstance(value, list):
       val = []
       for item in value:
           val.append(re.sub(r"[^a-zA-Z0-9]", ' ', str(item)))
    dct[key] = val

print(dct)
#{'randomId': '123y uhnb g', 'desc': ['sample description']}
Devesh Kumar Singh
  • 20,259
  • 5
  • 21
  • 40