0

I have a large pandas dataframe and would like to perform a thorough text cleaning on it. For this, I have crafted the below code that evaluates if a character is either an emoji, number, Roman number, or a currency symbol, and replaces these with their unidode name from the unicodedata package.

The code uses a double for loop though and I believe there must be far more efficient solutions than that but I haven't managed to figure out yet how I could implement it in a vectorized manner.

My current code is as follows:

from unicodedata import name as unicodename 

def clean_text(text):
    for item in text:
        for char in item: 
            # Simple space
            if char == ' ':
                newtext += char 
            # Letters
            elif category(char)[0] == 'L':
                newtext += char
            # Other symbols: emojis
            elif category(char) == 'So':
                newtext += f" {unicodename(char)} "
            # Decimal numbers 
            elif category(char) == 'Nd':
                newtext += f" {unicodename(char).replace('DIGIT ', '').lower()} "
            # Letterlike numbers e.g. Roman numerals 
            elif category(char) == 'Nl':
                newtext += f" {unicodename(char)} "
            # Currency symbols
            elif category(char) == 'Sc':
                newtext += f" {unicodename(char).replace(' SIGN', '').lower()} "
            # Punctuation, invisibles (separator, control chars), maths symbols...
            else:
                newtext += " "

At the moment I am using this function on my dataframe with an apply:

df['Texts'] = df['Texts'].apply(lambda x: clean_text(x))

Sample data:

l = [
    "thumbs ups should be replaced: ",
    "hearts also should be replaced:  ❤️️❤️️❤️️❤️️",
    "also other emojis: ☺️☺️",
    "numbers and digits should also go: 40/40",
    "Ⅰ, Ⅱ, Ⅲ these are roman numerals, change 'em"
]
df = pd.DataFrame(l, columns=['Texts'])
Derek O
  • 16,770
  • 4
  • 24
  • 43
lazarea
  • 1,129
  • 14
  • 43
  • save/load file with encoding='utf-8-sig', will that help? – Yadnesh Salvi Jan 31 '22 at 17:30
  • 1
    If you have a large dataframe, you might be better off using a numpy based solution which is vectorized. if you could include a representative sample of your dataframe with your various special characters, or even simply include `df.head(20).to_dict()` in your question, that would help us run your function, and also run some performance tests – Derek O Jan 31 '22 at 18:13
  • 1
    Hi @DerekO, I added a small sample dataset of five rows to represent the special character types I am currently replacing. – lazarea Jan 31 '22 at 18:25
  • @YadneshSalvi what do you mean by that? My question isn't about loading the file but processing the texts in it. Could you please elaborate on where that would fit in the process? – lazarea Jan 31 '22 at 18:27
  • 2
    @DerekO A mapping function like this doesn't really vectorize very well, I'm afraid... – AKX Jan 31 '22 at 18:53

1 Answers1

2

A good start would be to not do as much work:

  1. once you've resolved the representation for a character, cache it. (lru_cache() does that for you)
  2. don't call category() and name() more times than you need to
from functools import lru_cache
from unicodedata import name as unicodename, category


@lru_cache(maxsize=None)
def map_char(char: str) -> str:
    if char == " ":  # Simple space
        return char

    cat = category(char)

    if cat[0] == "L":  # Letters
        return char

    name = unicodename(char)

    if cat == "So":  # Other symbols: emojis
        return f" {name} "
    if cat == "Nd":  # Decimal numbers
        return f" {name.replace('DIGIT ', '').lower()} "
    if cat == "Nl":  # Letterlike numbers e.g. Roman numerals
        return f" {name} "
    if cat == "Sc":  # Currency symbols
        return f" {name.replace(' SIGN', '').lower()} "
    # Punctuation, invisibles (separator, control chars), maths symbols...
    return " "


def clean_text(text):
    for item in text:
        new_text = "".join(map_char(char) for char in item)
        # ...
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Very neat solution @AKX, thanks a ton! So if the function is repeatedly called and is passed "" the second time already, `lru_cache` will ensure that `category(char)` and `unicodename(char)` don't need to be evaluated again as the category and name attributes have been cached? Do I understand the role of `lru_cache` well? – lazarea Jan 31 '22 at 19:21
  • 1
    `@lru_cache()` simply bypasses the entire function body if it's called with the same argument(s) as it has been called with before, and returns the output from that time. This is called memoization. Of course it only works correctly with pure functions - caching `time.time()` would be ill-advised. :) – AKX Jan 31 '22 at 19:28
  • 1
    And of course `lru_cache()` comes with a memory cost - I assumed you have enough memory, so the maximum size is set to `None`, e.g. unlimited, but if you run out of memory, that might be a thing to tweak. – AKX Jan 31 '22 at 19:30
  • Fantastic! Thanks a lot again, I learned a lot from this solution. :) – lazarea Jan 31 '22 at 19:32