2

I have these huge parquet files, stored in a blob, with more than 600k rows and I'd like to retrieve the first 100 so I can send them to my client app. This is the code I use now for this functionality:

private async Task < Table > getParquetAsTable(BlobClient blob) {
  var table = new Table();
  using(var stream = await blob.OpenReadAsync()) {
    using(var memory = new MemoryStream()) {
      await stream.CopyToAsync(memory);
      var parquetReader = new ParquetReader(memory);

      table = parquetReader.ReadAsTable();
    }
  }
  var first100 = table.Take(100);
}

However, this process is kind of slow. await stream.CopyToAsync(memory); takes 20 seconds and table = parquetReader.ReadAsTable(); takes 15 more so totally I have to wait 35 seconds.

Is there a way to limit this stream and get the first 100 rows at once, without having to download all of the rows, format them with ReadAsTable and then take the first 100 only?

anthino12
  • 770
  • 1
  • 6
  • 29
  • Read rows one by one up to 100. – jdweng Oct 04 '21 at 11:10
  • How can I achieve that, I googled reading row by row parquets in c# but couldn't find any examples.. – anthino12 Oct 04 '21 at 11:25
  • The "[reading files](https://github.com/elastacloud/parquet-dotnet#reading-files)" example demonstrates opening a file as a `Stream` type then feeding that directly to a `ParquetReader` constructor. Therefore, you should be able to save 20 seconds by skipping the copy to a `MemoryStream`. – Daniel Dearlove Oct 04 '21 at 12:52

1 Answers1

0

With Cinchoo ETL - an open source library, you can stream Parquet file as below. (uses Parquet.net under the hood.)

Install Nuget package

install-package ChoETL.Parquet

Sample code

using ChoETL;

using (var r = new ChoParquetReader(@"*** Your Parquet file ***")
    .ParquetOptions(o => o.TreatByteArrayAsString = true)
    )
{
     var dt = r.Take(100).AsDataTable();
}

For more information, please visit codeproject article.

Cinchoo
  • 6,088
  • 2
  • 19
  • 34