1

Marklogic version : 9.0-6.2 mlcp version: 9.0.6

I am trying to import XML file into marklogic using MLCP uisng below code.

#!/bin/bash
mlcp.sh import -ssl \
-host localhost \
-port 8010 \
-username uname \
-password pword \
-mode local \
-input_file_path /data/testsource/*.XML \
-input_file_type documents \
-aggregate_record_namespace "http://new.webservice.namespace" \
-output_collections testcol \
-output_uri_prefix /testuri/ \
-transform_module /ext/ingesttransform.sjs

The code is running successfully with a small file but giving 'java heap space' error when run with large file (450 MB).

ERROR contentpump.MultithreadedMapper: Error closing writer: Java heap space

How could we resolve this error?

Bhanu
  • 427
  • 2
  • 8

3 Answers3

2

You can pass through Java heap settings to MLCP using the typical JVM_OPTS environment variable. Run java -X to see a list of all available options. I typically use these:

    -Xms<size>        set initial Java heap size
    -Xmx<size>        set maximum Java heap size
    -Xss<size>        set java thread stack size

You could invoke your script or MLCP like this:

JVM_OPTS="-Xmx1g" mlcp.sh ...

HTH!

grtjn
  • 20,254
  • 1
  • 24
  • 35
  • Thanks a lot for the response. I incrementally increased the value all the way upto "-Xmx1024g", but still failing within 2 seconds with java heap space error. The script running fine with small files. Do we need any server level changes? I am running the script from a different Linux server (not where MarkLogic is installed). Anything else I can try? – Bhanu Feb 14 '19 at 18:49
  • Try adding `-thread_count 1 -transaction_size 1 -batch_size 1` to force MLCP to process the files one by one. It may be trying to process a few big files in parallel. If that works, gently increase thread_count back to 10 to get better performance, and batch_size as well if you still have memory left.. – grtjn Feb 14 '19 at 19:11
  • Also, I don't think you will have 1024g memory available, so Java just stops when it depleted all.. – grtjn Feb 14 '19 at 19:11
  • Thanks again for the quick response. With the suggested changes, I am still getting 'Exception in thread "main" java.lang.OutOfMemoryError: Java heap space' error. – Bhanu Feb 14 '19 at 23:32
  • It sounds a bit like it is not picking up the memory setting, or still doing too much at once. You might wanna log the actual java call that is being executed by mlcp.sh, and check how many incoming requests you see in Admin ui Status tab, or Monitoring ui.. – grtjn Feb 15 '19 at 08:12
1

The mlcp job is designed to send the whole input file as one single document (-input_file_type documents) of size 500 MB into the transform module. The transform module has logic to spit uris and value (content.uri and content.value) for each aggregate element. This is resulting in java heap space error even though the heap space available on server is around 3.4 GB.

I tried two different designs that are working.

  1. Add aggregation in mlcp (-input_file_type aggregates, -aggregate_record_element CustId) to spit into multiple documents. This creates multiple documents in staging DB
  2. keep -input_file_type as documents and remove -transform_module, so the file is loaded as one single document into staging.

Both approaches are working, but the second approach may create documents with size of 500 MB (I believe the size limit is 512 MB). So I opted to use the first approach (also, I need a better uri than the default created by mlcp).

Bhanu
  • 427
  • 2
  • 8
  • The transform approach could still work, but again, you probably want to pace things down. Next to the `-thread_count 1` option, you also have `-batch_size 1` and `-transaction_size 1` that you could try. – grtjn Aug 31 '19 at 12:59
0

To clarify about loading a single large document vs many documents - that will depend on your input. If your input file is one large document, it will be loaded without splitting unless you specify an XML or JSON element/property to split on. For instance, a phoneBook.xml with 100,000 entries or a big phone: [ ] JSON array should be split up.

However, if your document is already split up into many records (typically CSV or other text formats) then you don't need to specify how to split it, because the format uses newlines to separate records and mlcp knows this.