1

how to decode a string to recognize french characters in python example 'Pneus été' should be like 'Pneus été'

I tried this but it appears that does not work

var ='Pneus été'
print(var.decode('utf-8'))

this is my original code:

from bs4 import BeautifulSoup
import os
import math
import requests 
import pandas as pd
import helpers
import os
if __name__== '__main__':
    soup = BeautifulSoup(open(os.getcwd()+"/Desktop/Pneus auto _ Michelin FR.html"), 'html.parser')
    tyre_category = soup.find_all('div', class_='tyre')
    for category in tyre_category:
        tyre_name = category.img['alt']
        tyre_season = category.find('span', class_='season-icon')['title']
        url_for_tyre_details = category.find('a', class_='tyre-detail')['href']

        print(tyre_name, tyre_season, url_for_tyre_details, sep=",")

OUTPUT:

MICHELIN Primacy 4,Pneus été,https://www.michelin.fr/pneus/michelin-primacy-4
MICHELIN Pilot Sport 4,Pneus été,https://www.michelin.fr/pneus/michelin-pilot-sport-4
MICHELIN Pilot Sport 4 S,Pneus été,https://www.michelin.fr/pneus/michelin-pilot-sport-4-s
MICHELIN Pilot Sport Cup 2,Pneus été,https://www.michelin.fr/pneus/michelin-pilot-sport-cup-2
MICHELIN CrossClimate+,toutes saisons,https://www.michelin.fr/pneus/michelin-crossclimateplus
MICHELIN Alpin 6,Pneus Hiver,https://www.michelin.fr/pneus/michelin-alpin-6
MICHELIN Pilot Alpin 5,Pneus Hiver,https://www.michelin.fr/pneus/michelin-pilot-alpin-5
MICHELIN Pilot Alpin PA4,Pneus Hiver,https://www.michelin.fr/pneus/michelin-pilot-alpin-pa4

please note that the variable tyre_season get's printed like this 'Pneus été' and i wanted to be like that 'Pneus été'

AbdeAMNR
  • 165
  • 1
  • 5
  • 18
  • 2
    Well, that string doesn't contain French accented characters. It contains [garbage](https://en.wikipedia.org/wiki/Mojibake). How did that string get like this in the first place? – deceze May 22 '18 at 15:45
  • Possible duplicate of [Python - detect charset and convert to utf-8](https://stackoverflow.com/questions/6707657/python-detect-charset-and-convert-to-utf-8) – Giacomo Catenazzi May 22 '18 at 15:50
  • in fact, I tried to scrape a webpage and when I wanted to grab this sentence "Pneus été" and print it, I am getting this "Pneus été" and I would like to get the correct sentence. @GiacomoCatenazziv – AbdeAMNR May 22 '18 at 23:46
  • 1
    You’re just opening the file using the wrong/unspecified encoding. – deceze May 23 '18 at 09:44
  • 1
    add an `ecoding='utf-8'` named parameter to the `open` call, should do. – Arminius May 23 '18 at 10:01
  • `soup = BeautifulSoup(open(os.getcwd()+"/Desktop/Pneus auto _ Michelin FR.html", encoding='utf-8'), 'lxml')`
    **this works for me**
    – AbdeAMNR May 23 '18 at 10:07
  • 1
    Would you like to accept my answer then? Great to know it solved your problem! – Arminius May 24 '18 at 16:10

1 Answers1

1

What your string in the question contains is the UTF-8 representation of the unicode string Pneus été. You can try this like so:

s = 'Pneus été'
s.encode(encoding='utf-8')

This results in the encoded bytes b'Pneus \xc3\xa9t\xc3\xa9'

Or the other way round: if you take the bytes and decode them as UTF-8:

s = b'Pneus \xC3\xA9t\xC3\xA9'
s.decode('utf-8')

You get: 'Pneus été' as unicode string.

So, somewhere in your code you have read a unicode string without proper decoding.

Arminius
  • 1,029
  • 7
  • 11
  • in fact, I tried to scrape a webpage and when I wanted to grab this sentence "Pneus été" and print it, I am getting this "Pneus été", why that happens – AbdeAMNR May 22 '18 at 23:49
  • 1
    Please post how you read the response from the webserver, lokks like this the place where the encoding get wrangled. – Arminius May 23 '18 at 05:33
  • `if __name__== '__main__': soup = BeautifulSoup(open("C:/Users/usrname/Desktop/Pneus auto _ Michelin FR.html"), 'lxml') tyre_category = soup.find_all('div', class_='tyre') for category in tyre_category: tyre_name = category.img['alt'] tyre_season = category.find('span', class_='season-icon')['title'] url_for_tyre_details = category.find('a', class_='tyre-detail')['href'] print(tyre_name, tyre_season, url_for_tyre_details, sep=",") print()`#################### please check the formatted code above – AbdeAMNR May 23 '18 at 09:20