how to encode a bs4 navigable string in python list?

Question

I have searched extensively and while there are tons of resources available for answering this question, I just can't seem to get any workable answer. I have watched this talk from Ned Batchelder on Unicode (https://nedbatchelder.com/text/unipain.html) and read through lots of answers on S.O. but I'm still at a loss.

I'm using Python 3 and BeautifulSoup 4 to scrape and parse a table from wikipedia. I have a list called fighter_B

    print(type(fighter_B))
    <class 'list'>

    print(type(fighter_B[0])
    <class 'bs4.element.NavigableString'>

The second and third observations in the list contain names with non-enlgish letters which throw an error, for example, Fabrício Werdum. When I try and print the fighter name I get this error,

print(fighter_B[1])
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

I've tried various encoding functions but I always end up throwing the same error.

[fighter.encode('utf-8') for fighter in fighter_B]
print(fighter_B[1])
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

for i in fighter_B:
    i.encode('utf-8')
print(fighter_B[1])
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

[fighter.decode('utf-8') for fighter in fighter_B]
AttributeError: 'NavigableString' object has no attribute 'decode'

[str(fighter).decode('utf-8) for fighter in fighter_B]
AttributeError: 'str' object has no attribute 'decode'

[fighter.encode('ascii') for fighter in fighter_B]
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 4: ordinal not in range(128)

All the various answers I have seen have simply suggested encoding the variable to 'utf-8'. I'm not sure why the encoding isn't working here and I am wondering if it is due to the fact that each item in the list is of type 'bs4.element.NavigableString'. Any tips would be greatly appreciated as I feel totally stumped at this point.

What is your Default Source Encoding? If it's utf-8 it should be working. — Vinícius Figueiredo, Jul 17 '17 at 01:21
print(sys.getdefaultencoding()) produces "utf-8" which is worrisome if that means it should be working — buchmayne, Jul 17 '17 at 01:41
I just tried running the script from the terminal and it executes fine without any errors! Does this mean that the issue has to do with Sublime Text? Or whatever the interpreter on Sublime Text is doing? — buchmayne, Jul 17 '17 at 01:48
This might be helpful: https://stackoverflow.com/questions/16195871/how-do-i-see-the-current-encoding-of-a-file-in-sublime-text-2 — Vinícius Figueiredo, Jul 17 '17 at 01:49

score 1 · Answer 1 · answered Jul 17 '17 at 19:56

Preliminary answer:

I've ran into the problem where you're trying to iterate through some block of HTML to pull out some value or values but it looks like this:

>>> for elem in li:
    type(elem)

<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>

In those cases, you can't iterate over the objects easily because the objects have different methods. Therefore, it might make sense to add another findAll containing further specificity of elements.

Does the following execute for you?

import requests
from bs4 import BeautifulSoup

url = r'https://en.wikipedia.org/wiki/List_of_male_mixed_martial_artists'

html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

names = []

for li in soup.findAll('li'):
    for i,link in enumerate(li.findAll('a')):
        if i == 1:
            names.append(link.getText())

Is 'Fabrício Werdum' in names returning True?

how to encode a bs4 navigable string in python list?

1 Answers1