0

In web scraping sometimes I need to get data from Persian webpages, so when I try to decode it and see the extracted data, the result is not what I expect to be.

Here is the step-by-step code for when this problem occurs :

1.getting data from a Persian website

import urllib2

data = urllib2.urlopen('http://cafebazar.ir').read() # this is a persian website

2.detecting type of encoding

import chardet
chardet.detect(data)
# in this case result is : 
{'confidence': 0.6567038227597763, 'encoding': 'ISO-8859-2'}

3. decoding and encoding

final = data.decode(chardet.detect(data)['encoding']).encode('ascii', 'ignore')

but the final result is not in Persian at all !

Uncle
  • 133
  • 1
  • 10
  • 2
    ASCII doesn't have Persian characters. I reckon you want 'utf-8' instead. Also that page seem to be encoded in UTF-8 already (which makes sense, because ISO-8859-2 doesn't have Persian characters either. – Bart Friederichs Sep 24 '16 at 10:47
  • That page has a `` tag stating that it's UTF-8, and the HTTP headers from the server also state that it's UTF-8. From the `import urllib2` I assume you're using Python 2, but you should always give the correct Python version tag for Unicode questions, since Python 2 & Python 3 have quite different Unicode handling. You may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. – PM 2Ring Sep 24 '16 at 10:52
  • @BartFriederichs so why result shows that this encoding is ISO ? – Uncle Sep 24 '16 at 11:16
  • @PM2Ring thank you. I just added my exact python version. would you mind helping me with code ? what should I do so ? – Uncle Sep 24 '16 at 11:17
  • 1
    why are you encoding and not just decoding? Also how on earth could ascii show Persian chars, your encode would remove all Persian or non-ascii chars. You can see the charset in the meta tag ` and `Content-Type: text/html; charset=utf-8` in the headers so pretty good clue that it is utf-8 encoded. – Padraic Cunningham Sep 24 '16 at 11:19
  • @Uncle I do not know that `chardet` package, but it is saying there is only a 60% confidence. And I think that is because most of the data in that string is HTML, which looks like ISO-8859-2. There are better ways to get the character encoding of a webpage. It is often in the HTTP header, or in the page's meta tags. – Bart Friederichs Sep 24 '16 at 11:21
  • You can't decode if lost `byte position`. Another point i try this url without any encoding errors. – dsgdfg Sep 24 '16 at 11:26

3 Answers3

1

The fundamental problem is that character-set detection is not a completely deterministic problem. chardet, and every program like it, is a heuristic detector. There is no guarantee or expectation that it will guess correctly all the time, and your program needs to cope with that.

If your problem is a single web site, simply inspect it and hard-code the correct character set.

If you are dealing with a constrained set of sites, with a restricted and somewhat predictable set of languages, most heuristic detectors have tweaks and settings you can pass in to improve the accuracy by constraining the possibilities.

In the most general case, there is no single solution which works correctly for all the sites in the world.

Many sites lie, they give you well-defined and helpful Content-Type: headers and lang tags ... which totally betray what's actually there - sometimes because of admin error, sometimes because they use a CMS which forces them to pretend their site is in a single language when in reality it isn't; and often because there is no language support in the back end, and something along the way "helpfully" adds a tag or header when in fact it would be more correct and actually helpful to say you don't know when you don't know.

What you can do is to code defensively. Maybe try chardet, then fall back to whatever the site tells you, then fall back to UTF-8, then maybe Latin-1? The jury is out while the world keeps on changing...

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

I had this problem and I don't think any of the above answers worked well

So I went for the answer myself and this code helped me

//In this section we enter the data
message="سلام دو.ستان من یک فارسی زبان هستم";
byte[] unicodeBytes = Encoding.UTF8.GetBytes(message);



Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
//convert normall bytes to ascci
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes );
//create new ascii chareacters
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
//convert accii char to string
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = Encoding.UTF8.GetString(plainText);

This code helped me, I hope it will be useful for you as well

A complete project has been created at the bottom In this example, we first convert a string to binary And then we reconstruct the same string from the binary state

using System;

using System.Net;
using System.Security.Cryptography;
using System.Text;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {

                   string unicodeString = "سلام این یک تست می باشد ";
                   Encoding ascii = Encoding.ASCII;
                   Encoding unicode = Encoding.Unicode;
                   byte[] unicodeBytes = 
                   Encoding.UTF8.GetBytes(unicodeString);

                    byte[] asciiBytes = Encoding.Convert(unicode, ascii, 
                    unicodeBytes);
                     char[] asciiChars = new 
                    char[ascii.GetCharCount(asciiBytes, 0, 
                    asciiBytes.Length)];
                    ascii.GetChars(asciiBytes, 0, asciiBytes.Length, 
                   asciiChars, 0);
                   string asciiString = 
                 Encoding.UTF8.GetString(unicodeBytes);

      
         }
    }
}

Displaying Arabic characters in C# console application This link also explains how to write in a Persian console. If you have not made these settings, you must first make these settings.

a.salehkho
  • 11
  • 3
  • 1
    Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes – Boken Dec 04 '20 at 11:03
  • 1
    I did my best to explain as easily as possible. If you have any questions, please ask – a.salehkho Dec 04 '20 at 14:03
  • Just to spell out the slightly unobvious, the question is about Python, but your answer seems to be in C#. – tripleee Dec 15 '20 at 05:21
0

Instead of encoding into ascii, you should decode into something else, for example utf-8:

final = data.decode(chardet.detect(data)['encoding']).encode('utf-8')

In order to view it though, you should write it into a file as most terminals do not display non-ascii chars correctly:

with open("temp_file.txt", "w", encoding="utf-8") as myfile:
    myfile.write(data.decode(chardet.detect(data)['encoding']))
Bharel
  • 23,672
  • 5
  • 40
  • 80
  • this is the result when I try to write data to a text file : Traceback (most recent call last): File "", line 2, in myfile.write(data.decode(chardet.detect(data)['encoding'])) UnicodeEncodeError: 'ascii' codec can't encode characters in position 131-138: ordinal not in range(128) – Uncle Sep 24 '16 at 16:46
  • it is utf-8 at first so why I encode it again to utf-8 ? – Uncle Sep 24 '16 at 16:51
  • @Uncle The question used a dynamic source of encoding, not necessarily utf-8. – Bharel Sep 24 '16 at 16:54
  • after some researches I got that if I print this data in C/C++ it gives me the right data. so how to use printf printing functionality in python ?? – Uncle Sep 30 '16 at 12:16