1

I'm working with some parquet files where I'm reading a file doing some stuff and then adding new columns with the original columns to a new parquet file.

using Stream fileStream = File.OpenRead(sourceFile);
using var parquetReader = new ParquetReader(fileStream);
Schema schema = parquetReader.Schema;
DataField[] dataFields = schema.GetDataFields();
var firstColumn= new List<Parquet.Data.DataColumn>();

And in the for loop that iterates through parquetReader.RowGroupCount I have the following:

using ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i);

Parquet.Data.DataColumn[] rowGroupColumns = dataFields.Select(groupReader.ReadColumn).ToArray();
FirstColumns.Add(rowGroupColumns.First(x => x.Field.Name == Settings.FirstColumnName));

Now After reading and processing I want to Add the original data along with the new data to a new parquet file. For testing purposes I wanted to just write the first column by doing following:

var FirstColumn = new Parquet.Data.DataColumn(
                new Parquet.Data.DataField<string>(Settings.FirstColumnName),
                firstColumns.Select(x=>x.Data).ToArray());
Schema newSchema = new Schema(FirstColumn.Field);

using Stream writeFileStream = File.Create(outputFile);
using var parquetWriter = new ParquetWriter(newSchema, writeFileStream);
using ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup();
groupWriter.WriteColumn(FirstColumn);

FirstColumn is basically a column that consists of integers. If I do a

Log.Info(firstColumns.Select(x=>x.Data).ToArray()))

I get the following:

Array[] {Nullable`1[] {304058000, 2955550, 268882222, 2442509222}}

Now after running this I get the following error:

error: System.InvalidCastException: Unable to cast object of type 'System.Array[]' to type 'System.String[]'.

Thus, what I need is something like (Just an assumption):

String[] {Nullable`1[] {"304058000", "2955550", "268882222", "2442509222"}}

I've tried converting to string by doing following:

var firstColumn = new Parquet.Data.DataColumn(
                new Parquet.Data.DataField<string>(Settings.FirstColumnName),
                firstColumns.Select(x=>x.Data.ToString()).ToArray());

I've also tried by using Array.ConvertAll with no success. If I print it by:

Log.Info(firstColumns.Select(x=>x.Data.ToString()).ToArray());

I get the following and not the expected output written above:

String[] {System.Nullable`1[System.Int64][]}

I might be doing this the hard way (and/or the wrong way), all I want to do is just copy the original data into the new parquet with added columns. There is probably an easier way to directly just use the original columns with data.

For ref. I'm using the following package: https://github.com/aloneguid/parquet-dotnet

Edit: I've tried writing only the new columns to file, in which I had no problems. It's just the original data from the sourcefile I'm having issues with.

The type of .Data is:

Parquet.Data.DataColumn firstColumn = rowGroupColumns.First(c => c.Field.Name == Settings.FirstColumnName);  
Log.Info(firstColumn.Data.GetType());

outputs:

System.Nullable1[System.Int64][]
Laende
  • 167
  • 2
  • 13
  • C# is a language of types - what is the actual type of `.Data`? – NetMage Jan 11 '22 at 18:03
  • `Parquet.Data.DataColumn firstColumn = rowGroupColumns.First(c => c.Field.Name == Settings.FirstColumnName); Log.Info(firstColumn.Data.GetType()); ` outputs: `System.Nullable1[System.Int64][]`, @NetMage – Laende Jan 11 '22 at 18:13

0 Answers0