0

I have a file (about 1 GB) containing a JSON array with a large number of items. These items should be read in by a .NET client (using Microsoft.Azure.Cosmos SDK v3.16.0) and created in a Cosmos DB collection using bulk execution.

Up to now, I have used CreateItemAsync to create the items, but that requires first deserializing the file into a list of objects. Would it be faster to use the CreateItemStreamAsync method instead? If so, how do I do that with a stream that contains an array of items? The following fails with status code RequestTimeout, probably because the method expects the stream to contain just a single item:

await using FileStream fs = File.OpenRead(path);
ResponseMessage response = await container.CreateItemStreamAsync(fs, partitionKey);
response.EnsureSuccessStatusCode(); // fails

I suppose I have to create an individual stream for each item from the single file stream, but how without deserializing the JSON array?

Mo B.
  • 5,307
  • 3
  • 25
  • 42
  • How long does it take to deserialize the file into memory? If you know for sure the file is already valid JSON with no transformation needed, you could try the Data Migration Tool or Azure Data Factory to do import. – Noah Stahl Jan 29 '21 at 21:42
  • @NoahStahl I tried the Data Migration Tool, incl. tweaking various parameters like number of parallel tasks. It's performance is not great because it is based on SDK v2 which doesn't handle 429 backoffs very well with concurrent tasks. My own tool based on SDK 3 is significantly faster even with the (unnecessary) JSON deserialization. I just thought I could get even more performance with the stream API. But perhaps there won't be any significant performance increase since the bottleneck is still the database (at least at 20 kRU/s). – Mo B. Jan 29 '21 at 22:37
  • 1
    Yeah, the migration tool isn't the fastest. I see why the intermediate parsing seems needless, though I can't imagine that it would take very much time compared to the Cosmos requests. That's why I was curious of observed timings. – Noah Stahl Jan 29 '21 at 22:42
  • @NoahStahl You are right. The 30 s or so for (de)serialization is insignificant compared to the 20 min to upload the data and store it in the DB. – Mo B. Jan 30 '21 at 11:18

0 Answers0