Clover ETL performance issue with processing large csv files with joins

Question

I am using CloverETL tools to read through a large csv files with over 100,000 rows, translate some lookup values from a DB before inserting them to a DB table.

Basically I used:

A flat file reader to read in the file
Followed by a number of 'Lookup Join' to translate some values from a DB
Then a 'ExtHashJoin' to link up all the data (since the source is not sorted)
Then Insert the record to a DB.

Pretty simple process and works fine for a small number of records. But when I try to import the whole file, it simply ran out of memory (increase memory allocation won't help).

I found the ExtHashJoin is trying to wait until it joins all records before starting to insert them and that seems to be the problem, since it has to store all records in memory. I don't really need that behaviour. Those records are all independent and can be processed in batches e.g. every 1000 rows a time but I cannot figure out a way to make it do this way.

I also tried to set different phases values but it still tries to join all values up before starting inserting the first record.

How can I tell the flat file reader to break the records down and process them in batches?

Thanks.

Eric

score 1 · Accepted Answer · answered Jul 31 '17 at 13:22

there is no easy way to force the FlatFileReader component to break records down and process them in batches. However, I can suggest a few tips that might help you resolve the memory overflow issue caused by the ExtHashJoin component:

As you rightly indicated, the ExtHashJoin component waits until all records (from the slave port) flow in before the joining starts. Generally, we recommend using the ExtHashJoin component if the number of slave port records to be joined is considerably low. If this was not the case, I would recommend using the ExtMergeJoin component. There is no caching (unlike ExtHashJoin) so the processing can be significantly faster. You can apply the same Master/Slave key definition and mapping for the ExtMergeJoin component as you did for the ExtHashJoin component. However, you would need to place a new ExtSort component before the SimpleCopy component in your graph in order to send the data sorted. The sort key should coincide with the master key defined in the ExtMergeJoin component.
If the sorting order of your data does not get changed on the way from the FlatFileReader to the ExtHashJoin component whatsoever, there might be even a simpler approach to decrease the memory consumption. Try replacing the ExtHashJoin component with a new Combine component and apply the same mapping as you did for the ExtHashJoin component. The Combine component avoids caching the records as well and there is no need for joining by keys so the performance should be fairly good.

More details concerning this topic can be found on the CloverETL Forum: http://forum.cloveretl.com/viewtopic.php?p=12926#p12926

Thank you very much Valdi for answering. I tried to use the ExtMergeJoin after placing a FastSort and it seems to be working ok now. — Eric Chen, Aug 01 '17 at 23:26

Clover ETL performance issue with processing large csv files with joins

1 Answers1