9

I'm running an UIMA application on apache spark. There are million of pages coming into batches to be processed by UIMA RUTA for calculation. But some time i'm facing out of memory exception.It throws exception sometime as it successfully process 2000 pages but some time fail on 500 pages.

Application Log

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
        at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
        at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
        at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
        at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
        at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
        at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
        at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
        at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT

WORDLIST EnglishStopWordList = 'stopWords.txt';
WORDLIST FiltersList = 'AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};
Gaurav
  • 139
  • 1
  • 16

1 Answers1

2

Normally, the reasons for high memory usage in UIMA Ruta can be found in RutaBasic (many annotation, coverage information) or in RuleMatch (inefficient rules, many rule element matches).

This your example, the problem seems to origin somewhere else. The stacktrace indicates that the memory is used up by some disjunctive rule element, which requires to create new annotations for storing the match information.

It seems that the version of UIMA Ruta is rather old since line number do not match at all with the source I am looking at.

There are seven (!!!) calls of continueOwnMatch in the stacktrace. I was looking for a rule that could cause something like this but found none. This could be a old flaw which has been fixed in newer versions, or some preprocessing added additional CW/SW/CAP annotations.

As a first advice, I would suggest two things:

  1. Update to UIMA Ruta 2.6.0
  2. Get rid of all disjunctive rule elements

The disjunctive rule elements are not really needed in your script. In general, they should not used at all if not really required. I do not use them at all in productive rules.

Instead of (SW | CW | CAP ) you can simply write W.

Instead of (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) you can write ANY{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))}.

Using ANY as a matching condition can reduce the runtime performance. In this example, two rules instead of the rule lement rewrite might be better, e.g., something like

SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};
PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

(optional rule elements at the start of a rule without any anchors in the rule are not optional)

btw, there is a lot of room for optimization in your rules. If I had to guess, I'd say you can get rid at least of half the rules and 90% of all created annotations, which would also considerably reduce the memory usage.

DISCLAIMER: I am a developer of UIMA Ruta

Peter Kluegl
  • 3,008
  • 1
  • 11
  • 8
  • I tried to change rule as per your suggestion but there is degradation of 10-15% in performance – Gaurav Jun 12 '17 at 10:57
  • Ok, that's strange. Did you have some overlapping Anchors before? How do you evaluate the performance (=accuracy?)? The rewrite should not change the result. – Peter Kluegl Jun 12 '17 at 14:33
  • Rewriting rule giving me exact same results. Performance i mean here is time taken to calculate anchors.I'm using ruta in spark for batches to get anchors from pages, previously it was taking less time to get the anchors from pages.No Doubt rewriting may taking less memory but i don't have such benchmark for now. – Gaurav Jun 12 '17 at 15:56
  • One more thing by increasing executor memory i'm not getting out of memory exception but as i have limitation of hardware i'm looking for ruta improvement right now i don't have enough bandwidth to upgrade ruta version for now as it may give me different results/issues but i also think this will boost performance with rule rewriting & version upgrade. – Gaurav Jun 12 '17 at 15:58
  • Yes, there is much room for optimizing the rules. I'd guess it could be ten times faster. I'll adapt the answer for avoiding the performance overhead. – Peter Kluegl Jun 14 '17 at 08:24
  • Yes I agree i'm not facing memory issue anymore. But how can i achieve 10x performance apart from rule rewriting ? – Gaurav Jun 14 '17 at 13:12
  • The comments of this question are not suitable to discuss the speed optimizations. Ask this question on the uima user mailing list and provide an exemplary document of representative size. I'll help you to optimize it but I am quite occupied the next 2 weeks. – Peter Kluegl Jun 23 '17 at 05:23
  • Essentially, you need to reduce the matches and the usage of UIMA iterators. Do you need the anchors annotations at all? Use a mtwl instead of the MARKFAST. Merge some rules, move some checks to the mtwl since you already need to check the complete document there. What is the output of your script? Data? Then, you can make all rules depednent of the anchor of the last one and avoid a lot of matches. – Peter Kluegl Jun 23 '17 at 05:27
  • let me create some sample data as original data is sensitive. I will drop these over mailing list. – Gaurav Jun 23 '17 at 16:20