0

I am scraping text from a webpage in Python.

The text contains all kinds of special unicode chars such as hearts, smilies and other wild stuff.

By using content.encode('ascii', 'ignore') I am able to convert everything to ASCII but that means all accented chars and mutated vowels such as 'ä' or 'ß' are gone as well.

How can leave the "normal" chars such as 'ä' or 'é' intact but can remove all the other stuff?

(I must admit I am quite a newbie in Python and I never really got behind all the magic behind character encoding).

LordOfTheSnow
  • 91
  • 1
  • 9
  • Can you provide an example input and expected output and also show what you tried so far. – Lefty G Balogh Apr 23 '18 at 17:54
  • 2
    Why can't you use unicode? – cwallenpoole Apr 23 '18 at 17:54
  • Possible duplicate of [How to replace unicode characters by ascii characters in Python (perl script given)?](https://stackoverflow.com/questions/2700859/how-to-replace-unicode-characters-by-ascii-characters-in-python-perl-script-giv) – Derek Brown Apr 23 '18 at 17:58
  • 1
    `content.encode('latin1','ignore')` will keep the common Western European accented characters. You'll still lose Russian, Japanese, Chinese, etc. – Mark Tolonen Apr 23 '18 at 18:03
  • @cwallenpoole: I would like to create a wordcloud in R later in that process and don't want all these wild characters. – LordOfTheSnow Apr 23 '18 at 18:04
  • @MarkTolonen with 'latin1' now I get something like `N\xe4gel`where it should be `Nägel`(German word for 'nails') – LordOfTheSnow Apr 23 '18 at 18:08
  • @LeftyGBalogh something like this: `# create translation map for non-bmp charactes non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd) question = "Magst du Nägel?" fileQuestions = open (filenameQuestions, "w", encoding='utf-8') fileQuestions.write("{}\n".format(question.encode('ascii', 'ignore')))` gives me `'Magst du Ngel?"` but I would like to keep the 'ä' but not the smiley. – LordOfTheSnow Apr 23 '18 at 18:19
  • Jesus... how can I add line breaks in comments? – LordOfTheSnow Apr 23 '18 at 18:24
  • 1
    @JörgF. you can't. Edit your question instead of creating illegible comments. – lenz Apr 23 '18 at 20:17
  • Is this Python 2 or 3? In the former case, `u'N\xe4gel'` is exactly what you want. – lenz Apr 23 '18 at 20:19
  • You mean `'N\xe4gel'` (with quotes)?. That's a debug representation. `print('N\xe4gel')` will display correctly. Python 2 only shows ASCII in a debug representation and escape codes for non-ASCII. See `repr()` vs. `str()`. Switch to Python 3 and it will display debug representations with non-ASCII as well, and `ascii()` can be used for the old presentation. – Mark Tolonen Apr 23 '18 at 23:26
  • Technically, my last comment about print will only work on a terminal configured for latin1. Use `sys.stdout.encoding` or decode the latin1 byte string back to Unicode before printing in Python 2. – Mark Tolonen Apr 23 '18 at 23:40

3 Answers3

1

It's not entirely clear from your question where you draw the line between the “good” and the “bad” characters, but you probably don't know that yet, either. Unicode contains a lot of different kinds of characters, and you might not be aware of the diversity.

Unicode assigns a category to each character, such as “Letter, lowercase” or “Punctuation, final quote” or “Symbol, other”. Python's std-lib module unicodedata gives you convenient access to this information:

>>> import unicodedata as ud
>>> ud.category('ä')
'Ll'
>>> ud.category('')
'So'

From your examples it seems like you think letters are good, while symbols are bad. But you'll have to sort out the rest too. You probably want to keep blanks (“separators”) and punctuation as well. And you might need the marks too, as they include the combining characters.

lenz
  • 5,658
  • 5
  • 24
  • 44
0

Few steps:

You should normalize unicode, with unicodedata.normalize('NFC', my_text). Not really on the question, but you must have a common ground, let's have the same character to have the same encoding.

Then you should check every character to see if you allow it or not:

new_text = []
for c in my_normalized_text:
    if ord(c) < 128:
        # this is optional, it add ascii character as they are
        # possibly you want to tokenize (see later, how we replace punctuation)
        new_text.append(c)
        continue
    cat = unicodedata.category(c)
    if cat in {'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nd'}:
        new_text.append(c)
    elif cat in {'Mc', 'Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Of', 'Po', 'Zs', 'Zl', 'Zp'}:
        # this tokenize
        new_text.append(' ')
    # else: do not append. You may still append ' ' and remove above check.

You should adapt according your next processing methods: See Python Unicode HOWTO and the linked page Unicode character categories.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
0

Well, I finally used this:

    # create translation map for non-bmp charactes
    non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

    # strip unwanted unicode images
    question = question.translate(non_bmp_map)

    # convert to latin-1 to remove all stupid unicode characters
    # you may want to adapt this to your personal needs
    #
    # for some strange reason I have to first transform the string to bytes with latin-1
    # encoding and then do the reverse transform from bytes to string with latin-1 encoding as
    # well... maybe has to be revised later
    bQuestion = question.encode('latin-1', 'ignore')
    question = bQuestion.decode('latin-1', 'ignore')

Thanks to anybody who answered

LordOfTheSnow
  • 91
  • 1
  • 9