0

I'm able to get some information from a website with Python and BeautifulSoup. However I get an error when I have a path with a special character.

In the Italian language we have some special characters such as à, è, ì, ò and ù. If I manually set a, e, i, o and u parsing works. However if I use BeautifulSoup and parse it automatically I get an error. Do you know how can I convert these characters into simple vowels?

I put the following settings at the beginning of my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
AER
  • 1,549
  • 19
  • 37
  • Are you looking [to strip diacritics](http://stackoverflow.com/q/517923/364696)? The various accent marks you're talking about are diacritics, it's just unclear if that's the goal. – ShadowRanger Dec 08 '16 at 22:47

1 Answers1

0

Use the package unidecode. I've given a code sample below on how to use this:

from unidecode import unidecode as ud
italian_string = "L'italiano è classificato al 21º"
ud(italian_string)

The last line returns:

=> "L'italiano e classificato al 21o"
AER
  • 1,549
  • 19
  • 37
  • Well the problem is that I do web scraping. Lecter è was returned in this way: "é". If I use your system that characters will become "A(c)" – all_key_the Dec 09 '16 at 09:42
  • Work perfectly on this: https://repl.it/languages/python3 . What is the string encoded as? – AER Dec 10 '16 at 05:28
  • 1
    If you get `"é"` instead of `"è"`, your data is UTF-8 encoded. – cco Dec 10 '16 at 08:28
  • @cco So what I have to use instead? – all_key_the Dec 14 '16 at 12:21
  • Please put up a complete example of what you're trying to do - when you say 'path', do you mean the path to a file, a path to an element in the document, or the trailing components of a URL? each of these would have a different answer (and some could have more than one). Showing what you've tried and what you want to do will be a big help. – cco Dec 14 '16 at 23:12