0

I have a utf-8 encoded .csv-file that I load to H2O.ai in Python 3.7 using

h2o.load_dataset("my.csv")

The Scandinavian characters do not display correctly. The same problem persists if I save my H2OFrame to disk and open in an editor using utf-8. How can I make H2O.ai understand utf-8?

Many thanks.

rize
  • 849
  • 6
  • 17
  • 33
  • check this post: https://stackoverflow.com/questions/36462852/how-to-read-utf-8-files-with-pandas – Jessica Dec 21 '18 at 16:06
  • can you post an example of what your special characters look like and how the code breaks when you run h2o-3, see this question to see how someone made a reproducible example https://stackoverflow.com/questions/53863717/chinese-text-for-h2o-dataframe-in-python. thanks! – Lauren Jan 03 '19 at 21:25
  • @Lauren Thanks! The code doesn't break - the only problem is scandinavian chars are displayed incorrectly, as in above, and the problem persists, when I write my data to .csv on disk. – rize Jan 04 '19 at 13:09
  • I totally edited the question as it seems the problem is specifically with H2O.ai loading utf-8 encoded text. – rize Jan 04 '19 at 13:26
  • @rize thanks! could you post a sentence with Scandinavian so i can save it as a file and try to reproduce the issue? – Lauren Jan 04 '19 at 14:52
  • @Lauren Yeah, sure: ”Tässä vähän tekstiä åäö.” This sentence contains now all the Scandinavians that Swedish and Finnish have and those are all I need. – rize Jan 05 '19 at 17:12
  • @rize I tried to test out your issue and since it involves some code I posted it as an answer, does the example code below work for you? Or are you still seeing parsing issue when you run the test code I posted. Thanks! – Lauren Jan 09 '19 at 23:51

1 Answers1

-1

I ran a quick test using the characters you provide and was able to get everything to display correctly on H2O-3 version 3.20.0.8 and python 3.5 so hopefully newer versions also work.

In [7]: dd = ["Tässä vähän tekstiä åäö"]

In [8]: h2o.H2OFrame(dd)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[8]:
C1
-----------------------
Tässä vähän tekstiä åäö

[1 row x 1 column]

I also created a csv with the string as the first cell and it seemed to display correctly.

In [12]: hhf = h2o.import_file('Scandinavians.csv', header=-1)
Parse progress: |████████████████████████████████████████████████████████████████████████████| 100%

In [13]: hhf
Out[13]:
C1      C2     C3       C4
------  -----  -------  ----
Tässä  vähän  tekstiä  åäö

[1 row x 4 columns].

(If these code snippet's don't help I can try to update my response)

Lauren
  • 5,640
  • 1
  • 13
  • 19