In this project, there is a C# API, where I need to build a simple program that reads a parquet file and returns it in json form. Normally I use python, reading a parquet file in python is as simple as 1 line -- but I'm stuck with C# (beginner). Below is a snippet from the overall program, which takes an S3 URL, downloads the parquet file into a temp file and from there on the below code follows.
The code is failing at this line -
DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR
I am not entirely sure what the error message means -- is it the data being too big? Is it talking talking about a specific column, or data type not matching, or even column name being too long? I am trying to figure out what the error is, why it is, and also how to deal with it? Reading the same parquet file in Python (pd.read_parquet(filename)
) reveals all columns are float64
type, there are 90k rows and 30 columns.
System.ArgumentException
HResult=0x80070057
Message=Destination is too short. (Parameter 'destination')
Source=System.Private.CoreLib
StackTrace:
at System.ThrowHelper.ThrowArgumentException_DestinationTooShort()
at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data)
at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead)
at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc)
at Parquet.File.DataColumnReader.<ReadDataPageV1Async>d__13.MoveNext()
at Parquet.File.DataColumnReader.<ReadAsync>d__8.MoveNext()
at ConvertController.<ConvertToJSON>d__2.MoveNext() in C:\Users\myuser\Desktop\repos\frontend\project\Controllers\WebAPI_ParquetController.cs:line 78
This exception was originally thrown at this call stack:
[External Code]
ConvertController.ConvertToJSON(string) in WebAPI_ParquetController.cs
Code from the point the file is downloaded to a temporary file -
// Open the parquet file stream
using (Stream fileStream = System.IO.File.OpenRead(tempFilePath))
{
// Open parquet file reader
using (ParquetReader parquetReader = await ParquetReader.CreateAsync(fileStream))
{
// Get file schema
DataField[] dataFields = parquetReader.Schema.GetDataFields();
var result = new List<Dictionary<string, object>>();
// Enumerate through row groups in this file
for (int i = 0; i < parquetReader.RowGroupCount; i++)
{
// Create row group reader
using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
{
var rowGroupResult = new Dictionary<string, object>();
// Read all columns inside each row group
for (int c = 0; c < dataFields.Length; c++)
{
DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR
// Cast column data to the appropriate type
var columnData = column.Data;
var decodedData = new object[columnData.Length];
// Decode the column data
for (int idx = 0; idx < columnData.Length; idx++)
{
decodedData[idx] = column.Data.GetValue(idx);
}
string columnName = dataFields[c].Name;
rowGroupResult[columnName] = decodedData;
}
result.Add(rowGroupResult);
}
}
// Convert the result to JSON
var jsonResult = JsonConvert.SerializeObject(result);
return Ok(jsonResult);
}
}
}