6

Is it possible to use ReadList to read UTF-8 (or any other) encoded text files using ReadList[..., Word], or is it ASCII-only? If it's ASCII-only, is it possible to "fix" the encoding of the already read data with good performance (i.e. preserving the performance advantages of ReadList over Import)?

Import[..., CharacterEncoding -> "UTF8"] works but it's quite a bit slower than ReadList. $CharacterEncoding has no effect on ReadList

Download a sample UTF-8 encoded file here.

For testing performance on a large input, see the test file in this question.


Here are the timings of the answers on a large-ish text file:

Import

In[2]:= Timing[
 data = Import[file, "Text"];
 ]

Out[2]= {5.234, Null}

Heike

In[4]:= Timing[
 data = ReadList[file, String];
 FromCharacterCode[ToCharacterCode[data], "UTF8"];
 ]

Out[4]= {4.328, Null}

Mr. Wizard

In[5]:= Timing[
 string = FromCharacterCode[BinaryReadList[file], "UTF-8"];
 ]

Out[5]= {2.281, Null}
Community
  • 1
  • 1
Szabolcs
  • 24,728
  • 9
  • 85
  • 174

2 Answers2

6

This seems to work

FromCharacterCode[ToCharacterCode[ReadList["raw.php.txt", Word]], "UTF-8"]

The timings I get for the linked test file are

FromCharacterCode[ToCharacterCode[ReadList["test.txt", Word]], "UTF-8"]); // Timing

(* ==> {0.000195, Null} *)

Import["test.txt", "Text"]; // Timing

(* ==> {0.01784, Null} *)
Heike
  • 24,102
  • 2
  • 31
  • 45
  • Thanks! And indeed, `To/FromCharacterCode` was the key! (See the other answer and my edit with the timings.) – Szabolcs Nov 24 '11 at 10:38
  • @Szabolcs and Heike, I did not see this answer, even after I posted my update! I even refreshed the page in impatience to see if Szabolcs had tried my new answer. Perhaps I am that blind, but it seems like a glitch. Sorry for any toes I trod on. – Mr.Wizard Nov 24 '11 at 10:44
  • As I think about it I've been using a script blocker (NoScript). I'll bet it interfered with the page update somehow. – Mr.Wizard Nov 24 '11 at 10:54
4

If I leave out Word, this works:

$CharacterEncoding = "UTF-8";

ReadList["UTF8.txt"]

This however is a failure, because the data is not read as strings.

Please try this on a larger file and report its performance:

FromCharacterCode[BinaryReadList["UTF8.txt"], "UTF-8"]
Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125
  • 1
    ["Shamelessly stealing from each other to arrive to the best possible solution"](http://stackoverflow.com/questions/8041703/remove-white-background-from-an-image-and-make-it-transparent) :-) Who shall I award the answer to? ;-) – Szabolcs Nov 24 '11 at 10:38
  • Your first version fooled me too. I really need to turn on showing those quotation marks ... – Szabolcs Nov 24 '11 at 10:43