0

In this project, there is a C# API, where I need to build a simple program that reads a parquet file and returns it in json form. Normally I use python, reading a parquet file in python is as simple as 1 line -- but I'm stuck with C# (beginner). Below is a snippet from the overall program, which takes an S3 URL, downloads the parquet file into a temp file and from there on the below code follows.

The code is failing at this line - DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR

I am not entirely sure what the error message means -- is it the data being too big? Is it talking talking about a specific column, or data type not matching, or even column name being too long? I am trying to figure out what the error is, why it is, and also how to deal with it? Reading the same parquet file in Python (pd.read_parquet(filename)) reveals all columns are float64 type, there are 90k rows and 30 columns.

System.ArgumentException
  HResult=0x80070057
  Message=Destination is too short. (Parameter 'destination')
  Source=System.Private.CoreLib
  StackTrace:
   at System.ThrowHelper.ThrowArgumentException_DestinationTooShort()
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data)
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead)
   at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc)
   at Parquet.File.DataColumnReader.<ReadDataPageV1Async>d__13.MoveNext()
   at Parquet.File.DataColumnReader.<ReadAsync>d__8.MoveNext()
   at ConvertController.<ConvertToJSON>d__2.MoveNext() in C:\Users\myuser\Desktop\repos\frontend\project\Controllers\WebAPI_ParquetController.cs:line 78

  This exception was originally thrown at this call stack:
    [External Code]
    ConvertController.ConvertToJSON(string) in WebAPI_ParquetController.cs

Code from the point the file is downloaded to a temporary file -

        // Open the parquet file stream
        using (Stream fileStream = System.IO.File.OpenRead(tempFilePath))
        {
            // Open parquet file reader
            using (ParquetReader parquetReader = await ParquetReader.CreateAsync(fileStream))
            {
                // Get file schema
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                var result = new List<Dictionary<string, object>>();

                // Enumerate through row groups in this file
                for (int i = 0; i < parquetReader.RowGroupCount; i++)
                {
                    // Create row group reader
                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
                    {
                        var rowGroupResult = new Dictionary<string, object>();

                        // Read all columns inside each row group
                        for (int c = 0; c < dataFields.Length; c++)
                        {


                            DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR

                            // Cast column data to the appropriate type
                            var columnData = column.Data;
                            var decodedData = new object[columnData.Length];
                            // Decode the column data
                            for (int idx = 0; idx < columnData.Length; idx++)
                            {
                                decodedData[idx] = column.Data.GetValue(idx);
                            }
                            string columnName = dataFields[c].Name;

                            rowGroupResult[columnName] = decodedData;
                        }

                        result.Add(rowGroupResult);
                    }
                }

                // Convert the result to JSON
                var jsonResult = JsonConvert.SerializeObject(result);

                return Ok(jsonResult);
            }
        }
    } 
pyeR_biz
  • 986
  • 12
  • 36
  • You might have better luck opening an issue on the repo: https://github.com/aloneguid/parquet-dotnet/issues – Sal Aug 31 '23 at 05:06

0 Answers0