2

Edit Added some new information to make the question more clearly.

In matlab early before 2012B, the method urlread would return a string constructed by wrong charset if the web content's charset is not utf8. (It has been improved somewhat in Matlab 2012B)

For example

% a chinese website whose content encoding by gb2312
url = 'http://www.cnbeta.com/articles/213618.htm'; 
html = urlread(url)

Because Matlab encoded the html using utf8 instead of gb2312. You will see the chinese character in the html doesnot show correctly.

If I read a chinese website with utf8 encoded, then everything works fine:

% a chinese website whose content encoding by utf8
url = 'http://www.baidu.com/'; 
html = urlread(url)

So is there any way to reconstruct the string correctly from html? I have tried as following, but it didnot work:

>> bytes = unicode2native(html,'utf8');
>> str = native2unicode(bytes,'gb2312')

However, I do known there is a way to fix the urlread's problem: Type edit urlread.m in the console and then replace the code nearly Line 108 (In matlab 2011B):

output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'UTF-8');

by:

output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'gb2312');

Save the file, and now urlread would works for website encoded by gb2312. Actually, this solution point out why urlread doesnot work sometime. The method urlread always use the utf8 charset to encode string even if the content is not encoded by utf8.

Eastsun
  • 18,526
  • 6
  • 57
  • 81

1 Answers1

0

It seems that you already have the solution, just create a function called urlread_gb that can read gb2312.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122