-1

What is the best way to filter text in Python so that I am only including numbers, upper/lower case letters, all punctuation and characters for new line, tab, etc.

For example I might have the text below and want to get rid of the pictures, but the links, punctuation, letters, numbers are fine:

Episode 19 is OUT NOW! Pasta Go Go Food Review Candle Light Dinner in the Car! PASTA LA VISTA Click Link B…

I have looked at regex expressions, but not sure how that would work. I was trying re.match.

It looks like translation tables might be the way to go, but they don't seem to work by exclusion. I would like to define the set of characters I want and remove anything else.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
OptimusPrime
  • 777
  • 16
  • 25

2 Answers2

2

The unicodedata module will give you unicode categories as listed here: https://unicodebook.readthedocs.io/unicode.html#categories. Emoji are "So". There may be other categories you want to filter, but at least do

>>> import unicodedata
>>> text = "Episode 19 is OUT NOW! Pasta Go Go Food Review Candle Light Dinner in the Car! PASTA LA VISTA Click Link B…"
>>> filtered = "".join(c for c in text if "So" not in unicodedata.category(c))
>>> filtered
'Episode 19 is OUT NOW! Pasta Go Go Food Review Candle Light Dinner in the Car! PASTA LA VISTA Click Link B…'
tdelaney
  • 73,364
  • 6
  • 83
  • 116
0

A quick and dirty solution is just to convert the string to ascii, ignoring all non-ascii characters

unicode_string.encode('ascii', 'ignore')

This will only work for English...

Iain Shelvington
  • 31,030
  • 3
  • 31
  • 50