6

We are using parquet.net to write parquet files. I've set up a simple schema containing 3 columns, and 2 rows:

        // Set up the file structure
        var UserKey = new Parquet.Data.DataColumn(
            new DataField<Int32>("UserKey"),
            new Int32[] { 1234, 12345}
        );

        var AADID = new Parquet.Data.DataColumn(
            new DataField<string>("AADID"),
            new string[] { Guid.NewGuid().ToString(), Guid.NewGuid().ToString() }
        );

        var UserLocale = new Parquet.Data.DataColumn(
            new DataField<string>("UserLocale"),
            new string[] { "en-US", "en-US" }
        );

        var schema = new Schema(UserKey.Field, AADID.Field, UserLocale.Field
        );

When using a FileStream to write to a local file, a file is created, and when the code finishes, I can see two rows in the file (which is 1 kb after):

            using (Stream fileStream = System.IO.File.OpenWrite("C:\\Temp\\Users.parquet")) {
                using (var parquetWriter = new ParquetWriter(schema, fileStream)) {
                    // Creare a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
                        groupWriter.WriteColumn(UserKey);
                        groupWriter.WriteColumn(AADID);
                        groupWriter.WriteColumn(UserLocale);
                    }
                }
            }

Yet, when I attempt to use the same to write to our blob storage, that only generates an empty file, and the data is missing:

// Open reference to Blob Container
CloudAppendBlob blob = OpenBlobFile(blobEndPoint, fileName);

using (MemoryStream stream = new MemoryStream()) {

    blob.CreateOrReplaceAsync();

    using (var parquetWriter = new ParquetWriter(schema, stream)) {
        // Creare a new row group in the file
        using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
            groupWriter.WriteColumn(UserKey);
            groupWriter.WriteColumn(AADID);
            groupWriter.WriteColumn(UserLocale);
        }
    
    // Set stream position to 0
    stream.Position = 0;
    blob.AppendBlockAsync(stream);
    return true;
}

...

public static CloudAppendBlob OpenBlobFile (string blobEndPoint, string fileName) {
    CloudBlobContainer container = new CloudBlobContainer(new System.Uri(blobEndPoint));
    CloudAppendBlob blob = container.GetAppendBlobReference(fileName);

    return blob;
}

Reading the documentation, I would think my implementation of the blob.AppendBlocAsync should do the trick, but yet I end up with an empty file. Would anyone have suggestions as to why this is and how I can resolve it so I actually end up with data in the file?

Thanks in advance.

SchmitzIT
  • 9,227
  • 9
  • 65
  • 92
  • I might have just accidentally figured it out. While debugging, I noticed I ended up with a file with contents in my blob storage. Thus, my conclusion was the code actually works, but somehow there must have been some sort of a timing issue. I turned the method containing the above code into an Async method, and added an `await` statement before this code: `blob.AppendBlockAsync(stream);`, and that seems to have solved the problem of ending up with empty files. – SchmitzIT Aug 07 '20 at 09:31
  • 1
    calling `async` methods without `await` will return control to the next line without waiting on that async work to finish. Since your next line is 'return' the program would likely try to write and then be killed because the execution ended (e.g. http request, or just normal cli exit) – zaitsman Aug 09 '20 at 23:12
  • 1
    @zaitsman - Thanks, that's pretty much what I figured. It never occurred to me until I got sidetracked that the process was actually writing the data; it just never had time to finish, and so it looked as if the only thing it really did was create the file, and then call it a day. – SchmitzIT Aug 10 '20 at 06:37

1 Answers1

1

The explanation for the file ending up empty is the line:

blob.AppendBlockAsync(stream);

Note how the function called has the Async suffix. This means it expects whatever is calling it to wait. I turned the function the code was in into an Async one, and had Visual Studio suggest the following change to the line:

_ = await blob.AppendBlockAsync(stream);

I'm not entirely certain what _ represents, and hovering my mouse over it doesn't reveal much more, other than it being a long data type, but the code now works as intended.

SchmitzIT
  • 9,227
  • 9
  • 65
  • 92
  • 1
    Old but.. Normally you would do `var myValue= await something();` you are then assigning the result of `something()` to `myValue` if you dont want to use a return value (`myValue`) then you can assign to nothing, and that is what the underscore does; it basically throws away the result. – RoelA Aug 23 '21 at 13:14