1

Context: We are importing csv files in Neo4J database.
We have encountered performance issue with the batch

CALL apoc.periodic.iterate('
CALL apoc.load.csv(\'file:///myfile.csv\') yield map as row return row
','
WITH
    row.Source AS sourceName,
    row.Quote AS quoteName,
    row.Type AS type,
    row.CurveIndex AS curveIndexName,
    row.Entity AS entityName,
    row.FreeParam AS freeParamName,
    row.Location AS locationName,
    row.Maturity AS maturityName,
    row.DepMaturity AS depMaturityName,
    row.Seniority AS seniorityName,
    row.Tray AS trayName,
    row.VolatilityModel AS volatilityModelName
MATCH (marketData:MarketData {source:sourceName, quoteName:quoteName, type:type})
OPTIONAL MATCH  (curveIndex:CurveIndex {name:curveIndexName})
OPTIONAL MATCH  (entity:Entity {name:entityName})
OPTIONAL MATCH  (maturity:Maturity {name:maturityName})
OPTIONAL MATCH  (depMaturity:Maturity {name:depMaturityName})
OPTIONAL MATCH  (seniority:Seniority {name:seniorityName})
OPTIONAL MATCH  (tray:Tray {name:trayName})
OPTIONAL MATCH  (volatilityModel:VolatilityModel {name:volatilityModelName})

FOREACH (f1 IN CASE WHEN curveIndex IS NULL THEN [] ELSE [curveIndex] END |
    MERGE (marketData)-[:CURVE_INDEX]->(curveIndex))
FOREACH (f1 IN CASE WHEN entity IS NULL THEN [] ELSE [entity] END |
    MERGE (marketData)-[:ENTITY]->(entity))
FOREACH (f1 IN CASE WHEN maturity IS NULL THEN [] ELSE [maturity] END |
    MERGE (marketData)-[:MATURITY]->(maturity))
FOREACH (f1 IN CASE WHEN depMaturity IS NULL THEN [] ELSE [depMaturity] END |
    MERGE (marketData)-[:DEP_MATURITY]->(depMaturity))
FOREACH (f1 IN CASE WHEN seniority IS NULL THEN [] ELSE [seniority] END |
    MERGE (marketData)-[:SENIORITY]->(seniority))
FOREACH (f1 IN CASE WHEN tray IS NULL THEN [] ELSE [tray] END |
    MERGE (marketData)-[:TRAY]->(tray))
FOREACH (f1 IN CASE WHEN volatilityModel IS NULL THEN [] ELSE [volatilityModel] END |
    MERGE (marketData)-[:VOLATILITY_MODEL]->(volatilityModel))
', {batchSize:1000, iterateList:true, parallel:true, concurrency:4});

This command uses a lot of memory (many stop-the-world gc, some >10s) and take hours to be executed on a 900k lines file.

After several tries we found this command is much more efficient:

CALL apoc.periodic.iterate('
CALL apoc.load.csv(\'file:///myfile.csv\') yield map as row return row
','
WITH
    row.Source AS sourceName,
    row.Quote AS quoteName,
    row.Type AS type,
    row.CurveIndex AS curveIndexName,
    row.Entity AS entityName,
    row.FreeParam AS freeParamName,
    row.Location AS locationName,
    row.Maturity AS maturityName,
    row.DepMaturity AS depMaturityName,
    row.Seniority AS seniorityName,
    row.Tray AS trayName,
    row.VolatilityModel AS volatilityModelName
MATCH (marketData:MarketData {source:sourceName, quoteName:quoteName, type:type})
OPTIONAL MATCH  (curveIndex:CurveIndex {name:curveIndexName}),
                (entity:Entity {name:entityName}),
                (maturity:Maturity {name:maturityName}),
                (depMaturity:Maturity {name:depMaturityName}),
                (seniority:Seniority {name:seniorityName}),
                (tray:Tray {name:trayName}),
                (volatilityModel:VolatilityModel {name:volatilityModelName})

FOREACH (f1 IN CASE WHEN curveIndex IS NULL THEN [] ELSE [curveIndex] END |
    MERGE (marketData)-[:CURVE_INDEX]->(curveIndex))
FOREACH (f1 IN CASE WHEN entity IS NULL THEN [] ELSE [entity] END |
    MERGE (marketData)-[:ENTITY]->(entity))
FOREACH (f1 IN CASE WHEN maturity IS NULL THEN [] ELSE [maturity] END |
    MERGE (marketData)-[:MATURITY]->(maturity))
FOREACH (f1 IN CASE WHEN depMaturity IS NULL THEN [] ELSE [depMaturity] END |
    MERGE (marketData)-[:DEP_MATURITY]->(depMaturity))
FOREACH (f1 IN CASE WHEN seniority IS NULL THEN [] ELSE [seniority] END |
    MERGE (marketData)-[:SENIORITY]->(seniority))
FOREACH (f1 IN CASE WHEN tray IS NULL THEN [] ELSE [tray] END |
    MERGE (marketData)-[:TRAY]->(tray))
FOREACH (f1 IN CASE WHEN volatilityModel IS NULL THEN [] ELSE [volatilityModel] END |
    MERGE (marketData)-[:VOLATILITY_MODEL]->(volatilityModel))
', {batchSize:1000, iterateList:true, parallel:true, concurrency:4});

It use less memory and is executed in around 50s.

I am not sure to really understand the difference between the 2 commands. Does anybody can explain it ?

  • Can you summarize the difference between those queries? – cybersam Dec 02 '19 at 01:33
  • The batch build relationship between :Marketdata and optional properties. All nodes are already created. The first query do an OPTIONAL MATCH for each optional property, the second query do only one OPTIONAL MATCH with all the properties (comma separated clauses) – PY de Daran Dec 02 '19 at 03:47

2 Answers2

0

The two queries aren't doing the same thing, though with your specific data you may be getting the same result (so long as everything you're matching on in your OPTIONAL MATCHes always returns something).

OPTIONAL MATCH is basically saying 'If the pattern I'm OPTIONAL MATCHing doesn't exist, just put nulls in place of anything derived from the nodes and relationships in the OPTIONAL MATCH.

Comma-separated chunks after a MATCH or OPTIONAL MATCH are all just one pattern to be found - we're not implicitly doing another MATCH for each comma we find, we instead find something that obeys the whole pattern (and return it), or don't (and return nulls for anything mentioned).

In your case:

  • The first query will find a MarketData node that matches the CSV row, then optionally match each other node type in turn and if it's found create a relationship between that node and the MarketData node.
    • Because you're OPTIONAL MATCHing each node label at a time, if any label is missing you'll simply get a null for that node variable and you'll skip creating the relationship because of your FOREACH trick
  • The second query will find a MarketData node that matches the CSV row same as the first, but then it'll either fulfil the whole pattern (find a CurveIndex that matches the row, and a Maturity that matches, and a Tray that matches) returning those values, or it'll fail to find all of them and return nulls for everything

If you're familiar with SQL, it's as though you're OUTER JOINing your CSV row to a subselect of inner-joined data when you're using commas, rather than query 1 which just OUTER JOINs each 'table' in sequence. In your second example, if anything doesn't match, your INNER JOIN will return no rows so you'll be left only with the CSV row and nothing else.

My feeling is that you're probably creating fewer nodes and relationships in the second query, because if anything from entity down to volatilityModel doesn't match the OPTIONAL MATCH won't return anything.

Let's do a simplified example with some dummy data, just containing MarketData, CurveIndex and Entity nodes:

MERGE (m: MarketData { quoteName: 'MSFT' })
MERGE (curveIndex: CurveIndex { name: 'SomeCurveIndex' })
MERGE (entity: Entity { name: 'MSFT Entity' })

And we'll fake up a CSV in memory, and avoid all the batching bits and bobs for now. First let's just return the matching MarketData, CurveIndex and Entity nodes when everything in the CSV matches, using your first query to prove it works:

WITH { Quote: 'MSFT', CurveIndex: 'SomeCurveIndex', Entity: 'MSFT Entity' } as row
WITH
    row.Quote AS quoteName,
    row.CurveIndex AS curveIndexName,
    row.Entity AS entityName
MATCH (marketData: MarketData { quoteName: quoteName })
OPTIONAL MATCH (curveIndex: CurveIndex { name: curveIndexName })
OPTIONAL MATCH (entity: Entity { name: entityName })
RETURN marketData, curveIndex, entity

enter image description here

Because all the data exists in both the CSV and the reference data already in the database, we'll get back three nodes. Now let's try changing the Entity name and seeing what happens:

WITH { Quote: 'MSFT', CurveIndex: 'SomeCurveIndex', Entity: 'Some Other Entity' } as row
WITH
    row.Quote AS quoteName,
    row.CurveIndex AS curveIndexName,
    row.Entity AS entityName
MATCH (marketData: MarketData { quoteName: quoteName })
OPTIONAL MATCH (curveIndex: CurveIndex { name: curveIndexName })
OPTIONAL MATCH (entity: Entity { name: entityName })
RETURN marketData, curveIndex, entity

enter image description here

Perfect - we get back the bits we could match, and get nulls for the bits we couldn't.

Now let's try concatenating those two OPTIONAL MATCH clauses and see the effect:

WITH { Quote: 'MSFT', CurveIndex: 'SomeCurveIndex', Entity: 'Some Other Entity' } as row
WITH
    row.Quote AS quoteName,
    row.CurveIndex AS curveIndexName,
    row.Entity AS entityName
MATCH (marketData: MarketData { quoteName: quoteName })
OPTIONAL MATCH (curveIndex: CurveIndex { name: curveIndexName }), (entity: Entity { name: entityName })
RETURN marketData, curveIndex, entity

enter image description here

Ruh roh - even though we found a matching CurveIndex, we didn't return it because we couldn't at the same time also find a matching Entity.

Pablissimo
  • 2,846
  • 2
  • 19
  • 21
0

@Pablissimo is correct about why the 2 queries act differently.

  1. One main reason for the slowness is that all the MATCH clauses are creating a cartesian product. One way to avoid a cartesian product is to minimize the number of MATCH clauses, but as you saw, combining them into a single clause does not work in your use case. But I have a fairly clever way to avoid using MATCH clauses entirely.

  2. Also, you probably have not created the appropriate indexes.

For simplicity, I will limit my discussion to just this snippet from your first query (you can expand this to the other optional variables):

OPTIONAL MATCH (curveIndex:CurveIndex {name:curveIndexName})
FOREACH (f1 IN CASE WHEN curveIndex IS NULL THEN [] ELSE [curveIndex] END |
  MERGE (marketData)-[:CURVE_INDEX]->(curveIndex))
  1. The above snippet can be replaced by the following, which uses pattern comprehension and a variable-length relationship to avoid using multiple OPTIONAL MATCH clauses (and the cartesian products they produce), and which also creates a single list of all the CurveIndex nodes so that they can be processed by a single FOREACH operation. By the way, a pattern comprehension requires a pattern with a relationship, which is why we use a 0-length relationship rather than just a bare node.

    FOREACH (ci IN [(c:CurveIndex {name:curveIndexName})-[*0..0]-()|c] |
      MERGE (marketData)-[:CURVE_INDEX]->(ci))
    
  2. You should also create indexes on:

    :MarketData(sourceName, quoteName, type)
    

    and

    :CurveIndex(name)
    
cybersam
  • 63,203
  • 6
  • 53
  • 76
  • Thanks a lot for your answer. Indeed, use of the pattern comprehension with 0-length path is much more efficient than multiple OPTIONAL MATCH: 175s on the same file, no memory issue. (Indexes was already created) – PY de Daran Dec 03 '19 at 01:43