0

I am not experienced with C#. I need to read a parquet file and then use LINQ to query the data read from the file. I don't know if I need to deserialise.

The following is the data in the parquet file

enter image description here

The data is being read into the 'records' variable. But when I use LINQ on it, I get the error, "Unable to cast object of type 'Parquet.Data.DataColumn' to type 'LinqAndParquet.DataFrame'." at the LINQ query.

public class Program
{
    public static DataColumn[] allData;
    public static DataColumn[] ReadParquetFile()
    {
        using (Stream fileStream = File.OpenRead(@"F:\AutomationRunStation\11_12.parquet"))
        {
            // open parquet file reader
            using (var parquetReader = new Parquet.ParquetReader(fileStream))
            {
                // get file schema (available straight after opening parquet reader)
                // however, get only data fields as only they contain data values
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                // enumerate through row groups in this file
                for (int i = 0; i < parquetReader.RowGroupCount; i++)
                {
                    // create row group reader
                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
                    {
                        // read all columns inside each row group (you have an option to read only
                        // required columns if you need to.
                        allData = dataFields.Select(groupReader.ReadColumn).ToArray();
                    }
                }

                return allData;
            }
        }
    }

    static void Main(string[] args)
    {
        var records = ReadParquetFile();
        
        var queryResult = from DataFrame data in records
                          where data.EventId == 280000001
                          select data.Loss;

        Console.WriteLine(queryResult);
        Console.ReadKey();
    }
}
Victor
  • 13
  • 1
  • 3
  • which line of code exactly throws the exception? – Kazys Jun 02 '21 at 05:29
  • Hi. The line of code that threw the exception is the link query, beginning with, "var queryResult = from DataFrame data in records". That code section. Thanks. – Victor Jun 02 '21 at 06:56

2 Answers2

0
public static DataColumn[] ReadParquetFile()

this returns DataColumn. So

    var records = ReadParquetFile();
    
    var queryResult = from DataFrame data in records
                      where data.EventId == 280000001
                      select data.Loss;

records in this scope is array of DataColumn. But in linq you are specifying data as DataFrame. Cast is not valid and you get exception.

Kazys
  • 377
  • 2
  • 12
  • Thanks. But when I replace 'DataFrame' with 'DataColumn[]', how do I query with Linq using the object format (data.EventId etc) cos now I get an error. I seek expertise in converting DataColumn[] to Linq queryable objects or other method(s) to do this query. Many regards. – Victor Jun 03 '21 at 04:38
  • Is your question how to query 'records' or how to convert DataColumn to DataFrame? – Kazys Jun 03 '21 at 06:00
  • Yes, querying 'records' is the problem. I believe I need to convert DataColumn to DataFrame to do this the best way. Thanks very much. – Victor Jun 03 '21 at 07:27
0

With Cinchoo ETL - an open source library, you can parse parquet file and use linq to query on them.

using (var r = new ChoParquetReader("*** YOUR PARQUET FILE PATH ***"))
{
    foreach (var rec in r)
        Console.WriteLine(rec.Dump());
}

Disclaimer: I'm author of this library.

Cinchoo
  • 6,088
  • 2
  • 19
  • 34