Understanding fold() and its impact on gremlin query cost in Azure Cosmos DB

Question

I am trying to understand query costs in Azure Cosmos DB

I cannot figure out what is the difference in the following examples and why using fold() lowers the cost:

g.V().hasLabel('item').project('itemId', 'id').by('itemId').by('id')

which produces the following output:

[
  {
    "itemId": 14,
    "id": "186de1fb-eaaf-4cc2-b32b-de8d7be289bb"
  },
  {
    "itemId": 5,
    "id": "361753f5-7d18-4a43-bb1d-cea21c489f2e"
  },
  {
    "itemId": 6,
    "id": "1c0840ee-07eb-4a1e-86f3-abba28998cd1"
  },           
....    
  {
    "itemId": 5088,
    "id": "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc"
  }
]

The cost is 15642 RUs x 0.00008 $/RU = 1.25$

g.V().hasLabel('item').project('itemId', 'id').by('itemId').by('id').fold()

which produces the following output:

[
  [
    {
      "itemId": 14,
      "id": "186de1fb-eaaf-4cc2-b32b-de8d7be289bb"
    },
    {
      "itemId": 5,
      "id": "361753f5-7d18-4a43-bb1d-cea21c489f2e"
    },
    {
      "itemId": 6,
      "id": "1c0840ee-07eb-4a1e-86f3-abba28998cd1"
    },
...
    {
      "itemId": 5088,
      "id": "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc"
    }
  ]
]

The cost is 787 RUs x 0.00008$/RU = 0.06$

g.V().hasLabel('item').values('id', 'itemId')

with the following output:

[
  "186de1fb-eaaf-4cc2-b32b-de8d7be289bb",
  14,
  "361753f5-7d18-4a43-bb1d-cea21c489f2e",
  5,
  "1c0840ee-07eb-4a1e-86f3-abba28998cd1",
  6,
...
  "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc",
  5088
]

cost: 10639 RUs x 0.00008 $/RU = 0.85$

g.V().hasLabel('item').values('id', 'itemId').fold()

with the following output:

[
  [
    "186de1fb-eaaf-4cc2-b32b-de8d7be289bb",
    14,
    "361753f5-7d18-4a43-bb1d-cea21c489f2e",
    5,
    "1c0840ee-07eb-4a1e-86f3-abba28998cd1",
    6,
...
    "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc",
    5088
  ]
]

The cost is 724.27 RUs x 0.00008 $/RU = 0.057$

As you see, the impact on the cost is tremendous. This is just for approx. 3200 nodes with few properties.

I would like to understand why adding fold changes so much.

score 1 · Answer 1 · answered Jul 07 '19 at 16:05

I was trying to reproduce your example, but unfortunately have opposite results (500 vertices in Cosmos):

g.V().hasLabel('test').values('id')

or

g.V().hasLabel('test').project('id').by('id')

gave respectively 86.08 and 91.44 RU, while same queries followed by fold() step resulted in 585.06 and 590.43 RU.

This result I got seems fine, as according to TinkerPop documentation:

There are situations when the traversal stream needs a "barrier" to aggregate all the objects and emit a computation that is a function of the aggregate. The fold()-step (map) is one particular instance of this.

Knowing that Cosmos charge RUs for both number of accessed objects and computations that are done on those obtained objects (fold in this particular case), higher costs for fold is as expected.

You can try to run executionProfile() step for your traversal, which can help you to investigate your case. When I tried:

g.V().hasLabel('test').values('id').executionProfile()

I got 2 additional steps for fold() (same parts of output omitted for brevity), and this ProjectAggregation is where the result set was mapped from 500 to 1:

 ...
      {
        "name": "ProjectAggregation",
        "time": 165,
        "annotations": {
          "percentTime": 8.2
        },
        "counts": {
          "resultCount": 1
        }
      },
      {
        "name": "QueryDerivedTableOperator",
        "time": 1,
        "annotations": {
          "percentTime": 0.05
        },
        "counts": {
          "resultCount": 1
        }
      }
...

Understanding fold() and its impact on gremlin query cost in Azure Cosmos DB

1 Answers1