0

I have the below csv content

123,Out,true,2014-09-30,東芝ライフスタイル 株式会社,null,Web,1234,false,2014-09-21T22:48:28.000+0000

I loaded the CSV using spark-csv

> df <- read.df(sqlContext, "japanese_char_file.csv", source = "com.databricks.spark.csv", inferSchema = "true")

I am trying to convert this SparkR dataframe into R data.frame using

temp2bd <- SparkR::collect(df)

It gives me the below error

Error in rawToChar(string) : embedded nul in string: 'q\x9d\xe9\xa4չ\xbf\xa4\xeb\0*\017\032>'

Below is the showDF response

'+---+---+----+----------+--------------+----+---+----+-----+--------------------+
| C0| C1|  C2|        C3|            C4|  C5| C6|  C7|   C8|                  C9|
+---+---+----+----------+--------------+----+---+----+-----+--------------------+
|123|Out|true|2014-09-30|q\x9d\xe9\xa4չ\xbf\xa4\xeb\0*\017\032>|null|Web|1234|false|2014-09-21T22:48:...|
+---+---+----+----------+--------------+----+---+----+-----+--------------------+\n'

Japanese characters are converted into q\x9d\xe9\xa4չ\xbf\xa4\xeb\0*\017\032> which seems to be causing this issue.

I came across 'Embedded nul in string' error when importing csv with fread and 'Embedded nul in string' error when importing csv with fread. But none worked for me

Do I need to change the way I am reading CSV content? Or is there way to run sed on Spark dataframe? Or is it a issue in SparkR?

I am using Spark 1.5.2 in standalone mode

Community
  • 1
  • 1
sag
  • 5,333
  • 8
  • 54
  • 91
  • Both problems you experience (this one, and slow reads) have been resolved in 1.6.0 and questions are duplicates. – zero323 Mar 18 '16 at 13:32
  • You can try to cherry pick patches from the repo (at least the one for reading) should work just fine in 1.5, but it is a better idea to update. – zero323 Mar 18 '16 at 13:40
  • yeah I missed for this one. But slow read is not duplicate. I've updated the question – sag Mar 18 '16 at 13:45
  • @zero323 - I will use spark 1.6 to solve this one. Is the slow processing fixed in Spark 1.6 as well? – sag Mar 18 '16 at 13:46
  • Should be. You'll find a link to my PR in the answer. – zero323 Mar 18 '16 at 13:48
  • @zero323 - This issue has been solved in Spark 1.6. But the other question http://stackoverflow.com/questions/36085619/any-operations-on-sparkr-dataframe-created-using-r-data-frame-is-very-slow is still valid. It is slow in Spark 1.6.1 as well. For a simple collect(df) took more than 30 secs. I've created an jira ticket https://issues.apache.org/jira/browse/SPARK-14037 for it – sag Mar 22 '16 at 05:26

0 Answers0