0

I'm using BeautifulSoup to parse names from multiple HTML files, some files contain celebrities' names in other languages like Spanish and occasionally have accents.

I've tried using 2 functions to strip the accents that seem to be working properly ('Jesús' -> 'Jesus') but when I start calling the function with data gathered with beautifulsoup I don't get the same result ('Jesús' -> 'JesAos')

my code:

def strip_accents_1(text):

    text = unicodedata.normalize('NFD', text)\
           .encode('ascii', 'ignore')\
           .decode("utf-8")
    return str(text)
    
def strip_accents_2(text):
   return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

# This works correctly
print(strip_accents_1('Jesús'))

def html_imp(h):

    soup = BeautifulSoup(open(h), features = "lxml")

    tdTags = []
    values =[]

    sort = []

    for i in range(4,40):
     for i in soup.find_all('td'):
         tdTags.append(i.text) 

    for i in [7,25,9,15,5,11]:
     values.append(tdTags[i])

    # Original name with accent
    sort.append(values[3])  
    
    # Strip accents
    sort.append(strip_accents_1(values[3]))
    
    sort.append(strip_accents_2(values[3]))
    

    print(sort)
    
    return sort

Output:

Jesus
['Jesús', 'JesAs', 'JesAos']

HTML fragment :

<TD WIDTH="80" ALIGN="center">Jesús</TD>

What's keeping the strip_accents functions from working while handling the HTML fragment?

Yassin
  • 166
  • 3
  • 8
  • How is the HTML fragment stored on disk? If it's in a utf-8 file, I'd expect the `open(h)` to be `open(h, encoding="utf-8")`, otherwise, you'll end up using some other encoding, and run into decode errors. – Anon Coward Feb 18 '21 at 18:35
  • Possible duplicate of https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string – tripleee Feb 18 '21 at 18:43
  • @bad_coder I mean if you read my question you could notice that I've used the same function as the solution you are linking, but instead of using a regular string and taking a bs4 value, I'm not getting the same result. – Yassin Feb 18 '21 at 18:48

2 Answers2

1

I know you may not be looking for yet another package to install...

But I find that Gensim has a great accent remover that works really well:

from gensim.utils import deaccent

deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
>>> u'Sef chomutovskych komunistu dostal postou bily prasek'

It turns out that the source code is super simple, a few lines long, and just uses unicode_data. Maybe you could just replicate that, check it out here.

Matt
  • 1,196
  • 1
  • 9
  • 22
  • I'm getting ['JesA�s Castellano Calero'] by using deaccent(values[3]) in my code. There's probably some sort of interfering made by beautifulsoup – Yassin Feb 18 '21 at 18:52
1

It would appear the crux of the issue is you're defaulting to the encoding used by Python, not the encoding of the file in question.

I simplified your code a bit in an attempt to debug the issue, hopefully it's demonstrative of the core issue:

import unicodedata
from bs4 import BeautifulSoup

def strip_accents(text):
    # Just a prefrence change
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    return text.decode("utf-8")
    
# A simplified version of your code
def html_imp_old(h):
    soup = BeautifulSoup(open(h), features = "lxml")
    tags = []
    for tag in soup.find_all('td'):
        tags.append(strip_accents(tag.text)) 
    print(tags)

# Same as _old, just specifying an encoding when reading the file
def html_imp_new(h):
    soup = BeautifulSoup(open(h, encoding="utf-8"), features = "lxml")
    tags = []
    for tag in soup.find_all('td'):
        tags.append(strip_accents(tag.text)) 
    print(tags)

# Make a self-contained snippet, so write out the HTML to disk
with open("temp.html", "wt", encoding="utf-8") as f:
    f.write("<TD WIDTH=\"80\" ALIGN=\"center\">Jes\u00fas</TD>\n")
# This works correctly, outputs "Jesus"
print(strip_accents('Jes\u00fas'))
# This doesn't work, outputs "JesAs" for me, though I assume this will be OS dependent behavior
html_imp_old("temp.html")
# This works correctly, outputs "Jesus"
html_imp_new("temp.html")
Anon Coward
  • 9,784
  • 3
  • 26
  • 37