1

I have searched extensively and while there are tons of resources available for answering this question, I just can't seem to get any workable answer. I have watched this talk from Ned Batchelder on Unicode (https://nedbatchelder.com/text/unipain.html) and read through lots of answers on S.O. but I'm still at a loss.

I'm using Python 3 and BeautifulSoup 4 to scrape and parse a table from wikipedia. I have a list called fighter_B

    print(type(fighter_B))
    <class 'list'>

    print(type(fighter_B[0])
    <class 'bs4.element.NavigableString'>

The second and third observations in the list contain names with non-enlgish letters which throw an error, for example, Fabrício Werdum. When I try and print the fighter name I get this error,

print(fighter_B[1])
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

I've tried various encoding functions but I always end up throwing the same error.

[fighter.encode('utf-8') for fighter in fighter_B]
print(fighter_B[1])
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

for i in fighter_B:
    i.encode('utf-8')
print(fighter_B[1])
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

[fighter.decode('utf-8') for fighter in fighter_B]
AttributeError: 'NavigableString' object has no attribute 'decode'

[str(fighter).decode('utf-8) for fighter in fighter_B]
AttributeError: 'str' object has no attribute 'decode'

[fighter.encode('ascii') for fighter in fighter_B]
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

All the various answers I have seen have simply suggested encoding the variable to 'utf-8'. I'm not sure why the encoding isn't working here and I am wondering if it is due to the fact that each item in the list is of type 'bs4.element.NavigableString'. Any tips would be greatly appreciated as I feel totally stumped at this point.

buchmayne
  • 144
  • 1
  • 15

1 Answers1

1

Preliminary answer:

I've ran into the problem where you're trying to iterate through some block of HTML to pull out some value or values but it looks like this:

>>> for elem in li:
    type(elem)

<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>

In those cases, you can't iterate over the objects easily because the objects have different methods. Therefore, it might make sense to add another findAll containing further specificity of elements.

Does the following execute for you?

import requests
from bs4 import BeautifulSoup

url = r'https://en.wikipedia.org/wiki/List_of_male_mixed_martial_artists'

html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

names = []

for li in soup.findAll('li'):
    for i,link in enumerate(li.findAll('a')):
        if i == 1:
            names.append(link.getText())

Is 'Fabrício Werdum' in names returning True?

Jarad
  • 17,409
  • 19
  • 95
  • 154