0

I am trying to parse multiple word doc files in apache spark when I run the script via spark submit lets say a word count as example it gives me an error as follows: unicodeencodeerror 'ascii' codec can't encode character u' ufffd' ordinal not in range 128.

Can we parse microsoft word documents in spark? Else is there any workaround for the same.

Thanks.

ADev
  • 1
  • 1
  • 1
    This is not related to Spark, you should check: http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20. – Vince.Bdn Jan 19 '16 at 07:57

1 Answers1

0

Besides what @Vince suggested, as a general rule, Spark needs something to parse binary documents like this into text. You might look at Apache Tika (https://tika.apache.org/) as a possible library you could use to parse Word (or PDF, etc.) docs into text. You would have to call it from a transformation step in your program. I haven't tried this, but perhaps someone else on the Interwebs has, like this project, https://github.com/scotthaleen/spark-hdfs-tika.

Dean Wampler
  • 2,141
  • 13
  • 10