4

Currently I am using com.crealytics.spark.excel to read an Excel file, but using this library I can't write the dataset to an Excel file.

This link says that using hadoop office library (org.zuinnote.spark.office.excel) we can read and write to Excel files

Please help me to write dataset object to an excel file in spark java.

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
Shashi Kumar
  • 147
  • 11
  • what's wrong with csv? – mtoto Jun 28 '17 at 10:35
  • CSV files don't support few formats ( eg: if a single cell contains any comma or other special characters it will split that single cell as multiple cells). Hence I need to write to an excel file, is there any approaches are there to do the same using Apache Spark? – – Shashi Kumar Jun 28 '17 at 10:52
  • Using existing tools, you could save the data to Hive, and then use Hue to download/generate an Excel file. You can also check what logic in Hue actually provides this Excel file. – Rick Moritz Jun 28 '17 at 11:23

1 Answers1

2

You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/. However, there is one issue if you read the Excel in Dataset and try to write it in another Excel file. Please see the issue and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12.

I have written a sample program in Java using org.zuinnote.spark.office.excel and workaround given at that link. Please see if this helps you.

public class SparkExcel {
    public static void main(String[] args) {
        //spark session
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkExcel")
                .master("local[*]")
                .getOrCreate();

        //Read
        Dataset<Row> df = spark
                .read()
                .format("org.zuinnote.spark.office.excel")
                .option("read.locale.bcp47", "de")
                .load("c:\\temp\\test1.xlsx");

        //Print
        df.show();
        df.printSchema();

        //Flatmap function
        FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
            @Override
            public Iterator<String[]> call(Row row) throws Exception {
                ArrayList<String[]> rowList = new ArrayList<String[]>();
                List<Row> spreadSheetRows = row.getList(0);
                for (Row srow : spreadSheetRows) {
                    ArrayList<String> arr = new ArrayList<String>();
                    arr.add(srow.getString(0));
                    arr.add(srow.getString(1));
                    arr.add(srow.getString(2));
                    arr.add(srow.getString(3));
                    arr.add(srow.getString(4));
                    rowList.add(arr.toArray(new String[] {}));
                }
                return rowList.iterator();
            }
        };

        //Apply flatMap function
        Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());

        //Write
        df2.write()
           .mode(SaveMode.Overwrite)
           .format("org.zuinnote.spark.office.excel")
           .option("write.locale.bcp47", "de")
           .save("c:\\temp\\test2.xlsx");

    }
}

I have tested this code with Java 8 and Spark 2.1.0. I am using maven and added dependency for org.zuinnote.spark.office.excel from https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3

abaghel
  • 14,783
  • 2
  • 50
  • 66
  • We are not able to add header information into the excel which we are writing Is there any options to write the headers to output excel file, Please help me to do the same – Shashi Kumar Jun 30 '17 at 10:25
  • How are you creating Dataset - by reading existing file or new Dateset using schema? If reading existing file then first row could be header. Can you please post a separate question with details and sample code. – abaghel Jun 30 '17 at 12:53
  • I just want to notify you that the current HadoopOffice library 1.0.4 supports reading/writing of headers as well as reading a file into a dataframe with simple datatypes (cf. https://github.com/ZuInnoTe/spark-hadoopoffice-ds). Some other notable things in 1.0.4 is the support of templates for writing (to include diagrams etc.) and reading/writing in low footprint mode (with a low memory/CPU footprint) based on Event and Streaming APIs of Apache POI. – Jörn Franke Aug 01 '17 at 17:30