Parquet file not writing correctly

Question

I've recently upgraded Parquet.Net to version 4.6.0, which necessitated changing a lot of the method calls to their asynchronous versions.

This code creates a file without any thrown errors:

        string file = @"c:\temp\test.parquet";
        var dataFields = new DataField[2];
        dataFields[0] = new DataField("dtUTC", DataType.Int64);
        dataFields[1] = new DataField("val", DataType.Double);
        var schema = new ParquetSchema(dataFields);
        var dtUTC = new long[3];
        var val = new double[3];

        using (Stream fileStream = System.IO.File.OpenWrite(file))
        {
            using (var parquetWriterTask = ParquetWriter.CreateAsync(schema, fileStream))
            {
                parquetWriterTask.Wait();
                var parquetWriter = parquetWriterTask.Result;
                using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup())
                {
                    var col0 = Array.CreateInstance(typeof(long), dtUTC.Length);
                    for(int i=0;i< dtUTC.Length;i++) col0.SetValue(dtUTC[i], i);
                    var writeT = groupWriter.WriteColumnAsync(new Parquet.Data.DataColumn(dataFields[0], col0));
                    writeT.Wait();
                    var col1 = Array.CreateInstance(typeof(double), val.Length);
                    for (int i = 0; i < val.Length; i++) col1.SetValue(val[i], i);
                    var writeT2 = groupWriter.WriteColumnAsync(new Parquet.Data.DataColumn(dataFields[1], col1));
                    writeT2.Wait();
                }
            }
        }

However, when I try to read that file, either using a 3rd party application, or my own code shown below, I get errors along the lines of

Destination is too short

or

Unable to read beyond the end of the stream

(Depending on which application I am using to read the file)

The code I use to read is:

        using (Stream file = System.IO.File.OpenRead(fileName))
        {
            using (var readerTask = ParquetReader.CreateAsync(file))
            {
                readerTask.Wait();
                var reader = readerTask.Result;
                if (reader.RowGroupCount != 1) throw new Exception("reader.RowGroupCount = " + reader.RowGroupCount);
                var dataFields = reader.Schema.GetDataFields();
                using (var gr = reader.OpenRowGroupReader(0))
                {
                    Parquet.Data.DataColumn[] columns = parquet.getCols(gr, dataFields);// dataFields.Select(gr.ReadColumn).ToArray();
                    output($"\tGroup 0: columns.Length={columns.Length}");
                }
            }
        }

What am I doing wrong when writing this parquet file?

Can you try await instead of using .Wait() ? I meant await groupWriter.WriteColumnAsync(.........); — kaarthick raman, Apr 05 '23 at 07:48
If for whatever reason you can't use `await` (i.e. make your code `async`, which is by far the best approach), use `.GetAwaiter().GetResult()` instead and not `.Wait()`. Your current `using` statements are disposing the `Task`s returned by the methods instead of the actual objects that are the results of those tasks, which means effectively nothing is getting flushed as part of the dispose, which explains the incomplete files. — Jeroen Mostert, Apr 05 '23 at 08:58
@JeroenMostert thank you very much, this is the problem. Fixed with that change. — mcmillab, Apr 06 '23 at 01:15

Parquet file not writing correctly

0 Answers0