1

When transforming an xml file to json, the Data Fusion pipeline, configured in Autoscaling mode up to 84 cores, stops indicating an error.

Can anybody help me to make it work?

The 100-pages Raw log file seems indicating that possible errors were:

  • +ExitOnOutOfMemoryError
  • Container exited with a non-zero exit code 3. Error file: prelaunch.err

It happened with the following configuration:

The weird thing is that the very same pipeline, with an xml file 10-times smaller, of only 141MB, worked correctly:

Can anybody help me in understanding why the Cloud Data Fusion pipeline, set in Autoscaling mode up to 84 cores, succeeds with the 141MB xml file and it fails with a 1.4GB xml file?

For clarity, following all the detailed steps:

1 Answers1

0

Parsing a 1GB xml file requires a significant amount of memory in your workers.

Looking at your pipeline JSON, your pipeline is currently configured to allocate 2GB of ram per worker.

"config": {
    "resources": {
        "memoryMB": 2048,
        "virtualCores": 1
    },
    "driverResources": {
        "memoryMB": 2048,
        "virtualCores": 1
    },
    ...
}

This is likely insufficient to hold the entire parsed ~1.1GB json payload.

Try increasing the amount of executor memory in the Config -> Resources -> Executor section. I would suggest trying with 8 GB of ram for your example.

Resources Config Example

EDIT: When using the Default or Autoscaling compute profile, CDF will create workers with 2 vCPU cores and 8 GB of Ram. You will need to increase this value using the following runtime arguments:

system.profile.properties.workerCPUs = 4 
system.profile.properties.workerMemoryMB = 22528 

Runtime arguments to increase worker CPU and Memory allocation

This will increase the worker size to 4 vCPU and 22GB of RAM, which will be large enough to fit the requested executor in the worker.

  • Hi Fernando, many thanks for your reply. First of all, can you explain why the autoscaling does not automatically scale the workers memory? Anyway, I followed your advice and increased the worker memory. But, I got an error: "Spark program 'phase-1' failed with error: Required executor memory (8192 MB), offHeap memory (0) MB, overhead (819 MB), and PySpark memory (0 MB) is above the max threshold (6554 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.. Please check the system logs for more details." – Mauro Di Pasquale Mar 23 '23 at 05:05
  • **NOTE: I have edited the post to explain how to increase the worker node memory allocation.** In Spark, autoscaling refers to the number of executors, not the size of the executors themselves. This wrangler step will run over the partition containing the entirety of the XML file, so the worker itself needs to be able to fit the entirety of the parsed XML in memory. – Fernando Velasquez Mar 23 '23 at 16:24
  • I also looked at your github post. If you want to scale up processing, it would be best if you create multiple small (no greater than 128MB) XML files instead of a huge XML file containing all orders. Multiple files can be read in parallel by multiple workers, and this will allow your pipeline to scale in processing capacity. Another thing to consider is that XML is not a splittable input format for Spark pipelines. Consider using a binary format like Avro/Parquet if you want to be able to scale up processing more efficiently, as even large files can be split across multiple workers. – Fernando Velasquez Mar 23 '23 at 20:52