Remove accents from Beautifulsoup strings

Question

I'm using BeautifulSoup to parse names from multiple HTML files, some files contain celebrities' names in other languages like Spanish and occasionally have accents.

I've tried using 2 functions to strip the accents that seem to be working properly ('Jesús' -> 'Jesus') but when I start calling the function with data gathered with beautifulsoup I don't get the same result ('Jesús' -> 'JesAos')

my code:

def strip_accents_1(text):

    text = unicodedata.normalize('NFD', text)\
           .encode('ascii', 'ignore')\
           .decode("utf-8")
    return str(text)
    
def strip_accents_2(text):
   return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

# This works correctly
print(strip_accents_1('Jesús'))

def html_imp(h):

    soup = BeautifulSoup(open(h), features = "lxml")

    tdTags = []
    values =[]

    sort = []

    for i in range(4,40):
     for i in soup.find_all('td'):
         tdTags.append(i.text) 

    for i in [7,25,9,15,5,11]:
     values.append(tdTags[i])

    # Original name with accent
    sort.append(values[3])  
    
    # Strip accents
    sort.append(strip_accents_1(values[3]))
    
    sort.append(strip_accents_2(values[3]))
    

    print(sort)
    
    return sort

Output:

Jesus
['Jesús', 'JesAs', 'JesAos']

HTML fragment :

<TD WIDTH="80" ALIGN="center">Jesús</TD>

What's keeping the strip_accents functions from working while handling the HTML fragment?

How is the HTML fragment stored on disk? If it's in a utf-8 file, I'd expect the `open(h)` to be `open(h, encoding="utf-8")`, otherwise, you'll end up using some other encoding, and run into decode errors. — Anon Coward, Feb 18 '21 at 18:35
Possible duplicate of https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string — tripleee, Feb 18 '21 at 18:43
@bad_coder I mean if you read my question you could notice that I've used the same function as the solution you are linking, but instead of using a regular string and taking a bs4 value, I'm not getting the same result. — Yassin, Feb 18 '21 at 18:48

Matt · Answer 1 · 2021-02-18T18:50:23.447

1

I know you may not be looking for yet another package to install...

But I find that Gensim has a great accent remover that works really well:

from gensim.utils import deaccent

deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
>>> u'Sef chomutovskych komunistu dostal postou bily prasek'

It turns out that the source code is super simple, a few lines long, and just uses unicode_data. Maybe you could just replicate that, check it out here.

edited Feb 18 '21 at 18:50

answered Feb 18 '21 at 18:29

Matt

1,196
1
9
22

I'm getting ['JesA�s Castellano Calero'] by using deaccent(values[3]) in my code. There's probably some sort of interfering made by beautifulsoup – Yassin Feb 18 '21 at 18:52

score 1 · Accepted Answer · answered Feb 18 '21 at 19:00

It would appear the crux of the issue is you're defaulting to the encoding used by Python, not the encoding of the file in question.

I simplified your code a bit in an attempt to debug the issue, hopefully it's demonstrative of the core issue:

import unicodedata
from bs4 import BeautifulSoup

def strip_accents(text):
    # Just a prefrence change
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    return text.decode("utf-8")
    
# A simplified version of your code
def html_imp_old(h):
    soup = BeautifulSoup(open(h), features = "lxml")
    tags = []
    for tag in soup.find_all('td'):
        tags.append(strip_accents(tag.text)) 
    print(tags)

# Same as _old, just specifying an encoding when reading the file
def html_imp_new(h):
    soup = BeautifulSoup(open(h, encoding="utf-8"), features = "lxml")
    tags = []
    for tag in soup.find_all('td'):
        tags.append(strip_accents(tag.text)) 
    print(tags)

# Make a self-contained snippet, so write out the HTML to disk
with open("temp.html", "wt", encoding="utf-8") as f:
    f.write("<TD WIDTH=\"80\" ALIGN=\"center\">Jes\u00fas</TD>\n")
# This works correctly, outputs "Jesus"
print(strip_accents('Jes\u00fas'))
# This doesn't work, outputs "JesAs" for me, though I assume this will be OS dependent behavior
html_imp_old("temp.html")
# This works correctly, outputs "Jesus"
html_imp_new("temp.html")

Remove accents from Beautifulsoup strings

2 Answers2