Indexing d:content property with content > 32 KB

Question

I have an Alfresco model type with an additional property of type d:content. This property causes Solr exceptions when I try to store content larger than 32 KB in it. The current definition of this property is

<property name="acme:secondContent">
  <type>d:content</type>
  <mandatory>false</mandatory>
  <index enabled="true">
    <atomic>true</atomic>
    <stored>true</stored>
    <tokenised>both</tokenised>
  </index>
</property>

If I put content larger that 32 KB into this property, Solr throws this exception when it tries to index it:

java.lang.IllegalArgumentException: Document contains at least one immense term in field="content@s____@{http://acme.com/model/custom/1.0}secondContent" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.

Changing the index configuration does not help, the error is thrown with all variants of index and the sub-elements I've tried.

In another question it is answered:

The maximum size for the a single term in the underlying Lucene index is 32776 bytes, which is I believe hard coded.

How do I configure the index of a d:content property so that I can save and index content larger than 32 KB?

Edit:

In contentModel.xml, cm:content is configured like this:

<index enabled="true">
  <atomic>true</atomic>
  <stored>false</stored>
  <tokenised>true</tokenised>
</index>

Adding a simple text/plain file with content larger than 32 KB works without problems.

The same index configuration for my custom property still fails.

Update:

Under Alfresco 4.2fCE, the problem does not occur. So this is a bug in Alfresco 5.0c together with Solr 4.1.9.

Update 2:

I've filed a bug in the Alfresco JIRA.

Setting `` to true should help. What is the content of that field? Would you loose anything if you have it in tokenised form only? Having it in string form would allow sorting and faceting. Is that required for that field? — cheffe, Apr 08 '15 at 11:13
No, sorting and faceting are not required. I'll try some more combinations. — , Apr 08 '15 at 11:36
Is there any reason you can't extend cm:content which includes a d:content property? — crownjewel82, Apr 08 '15 at 13:11
Yes. The type extends `cm:content` to store the whole content. I have custom code that takes a part of the content and stores it in the property I've described above. Think a multipart email body that I want to store whole while allowing access to the parts, too. — , Apr 08 '15 at 13:19
@Tichodroma You've not mentioned what version/edition of Alfresco you are using. Solr differs greatly in the latest version. — Mardoz, Apr 13 '15 at 13:21
The default Solr version that comes with Alfresco 5.0.c: 4.9.1 — , Apr 13 '15 at 13:24
Why are you using Alfresco 5? Have you tried with Alfresco 4.2.f? I suggest to wait Alfresco 5.1.x — Piergiorgio Lucidi, Apr 13 '15 at 20:39
Because Alfresco 5 is aready at 5.0d and keep asking when it will be ready for production use. I'm happy with 4.2 but forced to evaluate if existing code can be ported to 5.0. Waiting until 5.1 comes out is not a real option from a business point-of-view. BTW, the same problem happens in 4.2f. — , Apr 14 '15 at 04:04
I'm not a solr expert but it sounds for me like the issue is related to a change in the max bytes a doc may have to be stored as a whole in the index (which is used for debugging). Please check that you define false in your model to store only the real index values but not the whole text on top. — Heiko Robert, Apr 22 '15 at 20:06

score 5 · Answer 1 · answered Apr 13 '15 at 21:11

Hypothesis 1

If you have contents that contains similar very long terms (single words with 32k of length), you have to implement your own Lucene analyzers for supporting that specific type of text. This means that it is a problem related to the default Lucene implementation because it is hardcoded.

Hypothesis 2

Otherwise if your content is not structured in the way above, it sounds strange to me and probably could be a bug. If you are not solving using tokenised=true, in this case, a potential workaround could be based on changing the content model to support an association between the parent node and the specific type of node that contains the involved text but using the default cm:content property. I mean using associations you should solve ;)

Hope this helps.

Hypthesis 1 can be ruled out. The error occurs wit a normal lorem ipsum text. I don't have the option to change the content model as you propose in hyothesis 2. Existing Alfresco 4.2 installations must be migrated to 5.0. I'll file a bug at the Alfresco JIRA. — , Apr 14 '15 at 08:15

score 2 · Accepted Answer · answered Apr 23 '15 at 10:50

The solution is not to store the full doc/part in the index. So try to avoid store=true and tokenize=both/false on large properties having > 32k. Indexing should work if your model declaration looks like:

<property name="acme:secondContent">
  <type>d:content</type>
  <mandatory>false</mandatory>
  <index enabled="true">
    <atomic>true</atomic>
    <stored>false</stored>
    <tokenised>true</tokenised>
  </index>
</property>

drawback: In my test I had to drop the whole index. I was not sufficient to delete the cached models in solr

Indexing d:content property with content > 32 KB

2 Answers2