Context: We are importing csv files in Neo4J database.
We have encountered performance issue with the batch
CALL apoc.periodic.iterate('
CALL apoc.load.csv(\'file:///myfile.csv\') yield map as row return row
','
WITH
row.Source AS sourceName,
row.Quote AS quoteName,
row.Type AS type,
row.CurveIndex AS curveIndexName,
row.Entity AS entityName,
row.FreeParam AS freeParamName,
row.Location AS locationName,
row.Maturity AS maturityName,
row.DepMaturity AS depMaturityName,
row.Seniority AS seniorityName,
row.Tray AS trayName,
row.VolatilityModel AS volatilityModelName
MATCH (marketData:MarketData {source:sourceName, quoteName:quoteName, type:type})
OPTIONAL MATCH (curveIndex:CurveIndex {name:curveIndexName})
OPTIONAL MATCH (entity:Entity {name:entityName})
OPTIONAL MATCH (maturity:Maturity {name:maturityName})
OPTIONAL MATCH (depMaturity:Maturity {name:depMaturityName})
OPTIONAL MATCH (seniority:Seniority {name:seniorityName})
OPTIONAL MATCH (tray:Tray {name:trayName})
OPTIONAL MATCH (volatilityModel:VolatilityModel {name:volatilityModelName})
FOREACH (f1 IN CASE WHEN curveIndex IS NULL THEN [] ELSE [curveIndex] END |
MERGE (marketData)-[:CURVE_INDEX]->(curveIndex))
FOREACH (f1 IN CASE WHEN entity IS NULL THEN [] ELSE [entity] END |
MERGE (marketData)-[:ENTITY]->(entity))
FOREACH (f1 IN CASE WHEN maturity IS NULL THEN [] ELSE [maturity] END |
MERGE (marketData)-[:MATURITY]->(maturity))
FOREACH (f1 IN CASE WHEN depMaturity IS NULL THEN [] ELSE [depMaturity] END |
MERGE (marketData)-[:DEP_MATURITY]->(depMaturity))
FOREACH (f1 IN CASE WHEN seniority IS NULL THEN [] ELSE [seniority] END |
MERGE (marketData)-[:SENIORITY]->(seniority))
FOREACH (f1 IN CASE WHEN tray IS NULL THEN [] ELSE [tray] END |
MERGE (marketData)-[:TRAY]->(tray))
FOREACH (f1 IN CASE WHEN volatilityModel IS NULL THEN [] ELSE [volatilityModel] END |
MERGE (marketData)-[:VOLATILITY_MODEL]->(volatilityModel))
', {batchSize:1000, iterateList:true, parallel:true, concurrency:4});
This command uses a lot of memory (many stop-the-world gc, some >10s) and take hours to be executed on a 900k lines file.
After several tries we found this command is much more efficient:
CALL apoc.periodic.iterate('
CALL apoc.load.csv(\'file:///myfile.csv\') yield map as row return row
','
WITH
row.Source AS sourceName,
row.Quote AS quoteName,
row.Type AS type,
row.CurveIndex AS curveIndexName,
row.Entity AS entityName,
row.FreeParam AS freeParamName,
row.Location AS locationName,
row.Maturity AS maturityName,
row.DepMaturity AS depMaturityName,
row.Seniority AS seniorityName,
row.Tray AS trayName,
row.VolatilityModel AS volatilityModelName
MATCH (marketData:MarketData {source:sourceName, quoteName:quoteName, type:type})
OPTIONAL MATCH (curveIndex:CurveIndex {name:curveIndexName}),
(entity:Entity {name:entityName}),
(maturity:Maturity {name:maturityName}),
(depMaturity:Maturity {name:depMaturityName}),
(seniority:Seniority {name:seniorityName}),
(tray:Tray {name:trayName}),
(volatilityModel:VolatilityModel {name:volatilityModelName})
FOREACH (f1 IN CASE WHEN curveIndex IS NULL THEN [] ELSE [curveIndex] END |
MERGE (marketData)-[:CURVE_INDEX]->(curveIndex))
FOREACH (f1 IN CASE WHEN entity IS NULL THEN [] ELSE [entity] END |
MERGE (marketData)-[:ENTITY]->(entity))
FOREACH (f1 IN CASE WHEN maturity IS NULL THEN [] ELSE [maturity] END |
MERGE (marketData)-[:MATURITY]->(maturity))
FOREACH (f1 IN CASE WHEN depMaturity IS NULL THEN [] ELSE [depMaturity] END |
MERGE (marketData)-[:DEP_MATURITY]->(depMaturity))
FOREACH (f1 IN CASE WHEN seniority IS NULL THEN [] ELSE [seniority] END |
MERGE (marketData)-[:SENIORITY]->(seniority))
FOREACH (f1 IN CASE WHEN tray IS NULL THEN [] ELSE [tray] END |
MERGE (marketData)-[:TRAY]->(tray))
FOREACH (f1 IN CASE WHEN volatilityModel IS NULL THEN [] ELSE [volatilityModel] END |
MERGE (marketData)-[:VOLATILITY_MODEL]->(volatilityModel))
', {batchSize:1000, iterateList:true, parallel:true, concurrency:4});
It use less memory and is executed in around 50s.
I am not sure to really understand the difference between the 2 commands. Does anybody can explain it ?