I have been looking at tpl dataflow and tried to use on my current problem. Iām just not sure how to do it (or if I am barking up the wrong tree and there is something better suited).
I have a huge geojson-file with a lot of features that I would like to do some calculations on. The first part is pretty simple. Each feature has a property called time, and I need to save the data into seperate files by hour, i.e. a feature from 10:26 would be in a file called data10.geojson. The data is already sorted in time.
I would like to avoid loading the entire file into memory at once, so my idea was to load a single feature into memory, find out which hour it has and save it to the appropriate hour-file. My idea was to use tpl dataflow to avoid having to wait for one feature to finish before starting on the next (assuming I understood this correctly).
Geojson sample:
{
"type": "FeatureCollection",
"features": [
{
"Type": "Feature",
"Geometry":
{
"Type": "Point",
"Coordinates": [9, 55]
},
"Properties":
{
"time":"2019-08-08 10:39"
}
},
Many more like this
]
}
Through a lot of googling I found a way to load the json features directly from the stream one by one (to avoid loading everything into memory):
ActionBlock<GeojsonFeature> writeOutFeature = new ActionBlock<GeojsonFeature>(feature => Console.WriteLine(JsonConvert.SerializeObject(feature)));
bool parsingArray = false;
using (FileStream fileStream = File.Open(inputPath, FileMode.Open))
using (StreamReader streamReader = new StreamReader(fileStream))
using (JsonReader jsonReader = new JsonTextReader(streamReader))
{
var serializer = new JsonSerializer();
while (jsonReader.Read())
{
// checking if the array has been reached yet
if (!parsingArray && jsonReader.TokenType == JsonToken.StartArray)
{
parsingArray = true;
continue; // next will be a feature
}
else if(jsonReader.TokenType == JsonToken.EndArray)
{
parsingArray = false;
continue;
}
// loading feature
if (parsingArray)
{
GeojsonFeature feature = serializer.Deserialize<GeojsonFeature>(jsonReader);
await writeOutFeature.SendAsync(feature);
}
}
writeOutFeature.Complete();
await writeOutFeature.Completion;
}
This merely writes out each feature in an ActionBlock, but I need to know how to save it to the file corresponding to the hour given in the feature properties. (They are ordered, so at an hour change the old file should be closed and a new should be opened and initialized).
In short: How do I load a feature, classify it, and save it to the appropriate output file without loading everything into memory (fast)?