SparkR::collect(df) fails for Japanese characters

Question

I have the below csv content

123,Out,true,2014-09-30,東芝ライフスタイル　株式会社,null,Web,1234,false,2014-09-21T22:48:28.000+0000

I loaded the CSV using spark-csv

> df <- read.df(sqlContext, "japanese_char_file.csv", source = "com.databricks.spark.csv", inferSchema = "true")

I am trying to convert this SparkR dataframe into R data.frame using

temp2bd <- SparkR::collect(df)

It gives me the below error

Error in rawToChar(string) : embedded nul in string: 'q\x9d\xe9\xa4չ\xbf\xa4\xeb\0*\017\032>'

Below is the showDF response

'+---+---+----+----------+--------------+----+---+----+-----+--------------------+
| C0| C1|  C2|        C3|            C4|  C5| C6|  C7|   C8|                  C9|
+---+---+----+----------+--------------+----+---+----+-----+--------------------+
|123|Out|true|2014-09-30|q\x9d\xe9\xa4չ\xbf\xa4\xeb\0*\017\032>|null|Web|1234|false|2014-09-21T22:48:...|
+---+---+----+----------+--------------+----+---+----+-----+--------------------+\n'

Japanese characters are converted into q\x9d\xe9\xa4չ\xbf\xa4\xeb\0*\017\032> which seems to be causing this issue.

I came across 'Embedded nul in string' error when importing csv with fread and 'Embedded nul in string' error when importing csv with fread. But none worked for me

Do I need to change the way I am reading CSV content? Or is there way to run sed on Spark dataframe? Or is it a issue in SparkR?

I am using Spark 1.5.2 in standalone mode

Both problems you experience (this one, and slow reads) have been resolved in 1.6.0 and questions are duplicates. — zero323, Mar 18 '16 at 13:32
You can try to cherry pick patches from the repo (at least the one for reading) should work just fine in 1.5, but it is a better idea to update. — zero323, Mar 18 '16 at 13:40
yeah I missed for this one. But slow read is not duplicate. I've updated the question — sag, Mar 18 '16 at 13:45
@zero323 - I will use spark 1.6 to solve this one. Is the slow processing fixed in Spark 1.6 as well? — sag, Mar 18 '16 at 13:46
@zero323 - This issue has been solved in Spark 1.6. But the other question http://stackoverflow.com/questions/36085619/any-operations-on-sparkr-dataframe-created-using-r-data-frame-is-very-slow is still valid. It is slow in Spark 1.6.1 as well. For a simple collect(df) took more than 30 secs. I've created an jira ticket https://issues.apache.org/jira/browse/SPARK-14037 for it — sag, Mar 22 '16 at 05:26

SparkR::collect(df) fails for Japanese characters

0 Answers0