I have 100 Excel (*.xlsx) files stored in HDFS. The 100 *.xlsx files are organized into 10 directories, as shown below:
/user/cloudera/raw_data/dataPoint1/dataPoint.xlsx
/user/cloudera/raw_data/dataPoint2/dataPoint.xlsx
...
..
.
/user/cloudera/raw_data/dataPoint10/dataPoint.xlsx
Reading in one of *.xlsx files from above using
rawData = sc.textFile("/user/cloudera/raw_data/dataPoint1/dataPoint.xlsx")
threw gibberish data!
One obvious suggestion I received was to use the Gnumeric spreadsheet application's command-line utility called ssconvert:
$ ssconvert dataPoint.xlsx dataPoint.csv
and then dump it into the HDFS, so I can read the *.csv file directly. But that is not what I am trying to solve or is the requirement.
Solutions in Python
(preferable) and Java
would be appreciated. I am a rookie, so a detailed walkthrough would be really helpful.
Thanks in advance.