0

In order to clean some string, I have to remove some substring that contains some special UTF-8 characters.

example:

source = "Skoda"
to_be_clean = "Škoda Rapid"

I need to replace from to_be_clean the string source by nothing. Obviously, the to_be_clean string contains some special character. Is there a way to do this task simply. Here is how I am doing it today.

output = to_be_clean.replace(source + ' ', '')

I was thinking about a regular expression but I need to list all the possible characters.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Michael
  • 2,436
  • 1
  • 36
  • 57
  • 4
    It's *really* not clear what you want. Are you hoping to find a way to make `"Škoda"` equal to `"Skoda"` so that you can then remove it? There are many questions about removing accents from Unicode; have you googled those? – tripleee Feb 21 '18 at 15:01

1 Answers1

2

unicodedata module should solve your problem.

# -*- coding: utf-8 -*-

import unicodedata
to_be_clean = u"Škoda Rapid"

print unicodedata.normalize('NFKD', to_be_clean).encode('ASCII', 'ignore')

Output:

Skoda Rapid
Rakesh
  • 81,458
  • 17
  • 76
  • 113