1

I am considering using the Near Lake framework to extract raw data, such as blocks, transactions, and receipts. I'm curious if any data cleaning or transformation occurs on the raw data before it gets saved to the S3 bucket?

Here's an example of data cleaning performed before the data is written to S3.

Are there any additional processing steps like this, and if so, where can I find a comprehensive list of the data cleaning and transformation types that are applied?

gulshngill
  • 21
  • 2

1 Answers1

1

The data stored on S3 corresponds to the structure known as StreamerMessage, which reflects the data from the nearcore node but is not equivalent to it.

I started documenting the StreamerMessage some time ago, but I haven't had the opportunity to finish it yet. You might find it interesting: StreamerMessage Documentation

To put it simply, the purpose of the StreamerMessage is to represent the data from the node, specifically tailored for indexer developers. In order to facilitate developers' work, we have introduced several additional structures, such as:

  • IndexerExecutionOutcomeWithReceipt
  • IndexerExecutionOutcomeWithOptionalReceipt
  • IndexerChunk
  • and so on.

These structures do not have an exact counterpart in the nearcore primitives.

The answer to your question depends on how you define "raw data."

Regarding blocks, transactions, and receipts, these entities can still be extracted from the StreamerMessage. However, if you require a comprehensive list of the "transformations," you would need to examine the source code of the Indexer Framework, which can be found here.

The data is stored in S3 by the Lake Indexer, with the only transformation being the splitting of the entire StreamerMessage into files named block.json and shard_N.json.

P.S. It seems you're diving into an interesting topic that might be useful to other developers, feel free to open new questions, thus we would have more info about it on SO.

khorolets
  • 666
  • 3
  • 8