Python Filtering String Text to Remove Picture Characters

Question

What is the best way to filter text in Python so that I am only including numbers, upper/lower case letters, all punctuation and characters for new line, tab, etc.

For example I might have the text below and want to get rid of the pictures, but the links, punctuation, letters, numbers are fine:

Episode 19 is OUT NOW! Pasta Go Go Food Review Candle Light Dinner in the Car! PASTA LA VISTA Click Link B…

I have looked at regex expressions, but not sure how that would work. I was trying re.match.

It looks like translation tables might be the way to go, but they don't seem to work by exclusion. I would like to define the set of characters I want and remove anything else.

You can use the `unicodedata` module as in https://stackoverflow.com/a/62401725/642070 to do it. — tdelaney, Jun 20 '20 at 07:22

score 2 · Accepted Answer · answered Jun 20 '20 at 07:32

The unicodedata module will give you unicode categories as listed here: https://unicodebook.readthedocs.io/unicode.html#categories. Emoji are "So". There may be other categories you want to filter, but at least do

>>> import unicodedata
>>> text = "Episode 19 is OUT NOW! Pasta Go Go Food Review Candle Light Dinner in the Car! PASTA LA VISTA Click Link B…"
>>> filtered = "".join(c for c in text if "So" not in unicodedata.category(c))
>>> filtered
'Episode 19 is OUT NOW! Pasta Go Go Food Review Candle Light Dinner in the Car! PASTA LA VISTA Click Link B…'

Iain Shelvington · Answer 2 · 2020-06-20T07:26:01.337

0

A quick and dirty solution is just to convert the string to ascii, ignoring all non-ascii characters

unicode_string.encode('ascii', 'ignore')

This will only work for English...

edited Jun 20 '20 at 07:26

answered Jun 20 '20 at 07:21

Iain Shelvington

31,030
3
31
50

Python Filtering String Text to Remove Picture Characters

2 Answers2