1

I'm trying to index documents(.doc,.ppt,.pdf etc) as an attachment(storing the content field as BASE64 content) and then do a search query and highlight the content field on the resultant files. When I'm indexing them, why is the size of files increased?

For eg: The total size of the folder from which documents are indexed is 30mb. But the head plugin is showing 127mb for the same number of files(which are indexed from the same folder)

Here is my mapping style:

var response= client.CreateIndex(defaultIndex, c => c
                    .Mappings(m => m
                    .Map<Document>(mp => mp
                    .Properties(ps => ps
                        .String(s => s.Name(e => e.Title))
                        .Attachment(s => s.Name(p => p.File)
                            .FileField(ff => ff.Name(f => f.File)
                            .TermVector(TermVectorOption.WithPositionsOffsetsPayloads)
                            .Analyzer("english")
                            .Store(true)))))));

Observation:(Dont know if I'm correct with this) When I indexed the documents using manual id, the size is around 36mb but when I remove the Id field and index, then it is taking so much time to index, the size is more and the search function is not working properly. Does it depend on how the file is indexed?)

TIA

ASN
  • 1,655
  • 4
  • 24
  • 52

1 Answers1

1

The size of the index depends on many different factors. The raw size of your folder is not a good estimate for how much your index will weigh.

It depends a lot on the mapping of your fields, i.e. if you're indexing fields with large text content (seems to be your case) or not, if you have custom analyzers with ngrams tokenizers and/or token filters, etc. Lucene stores many different kind of files and the same token (with additional positions and offsets) might be in different files, all adding up to the size of your index.

Also, if you reindex your folder a few times over and over during your testing, then the index size will grow because you'll have a large amount of deleted documents.

Finally, BASE64 is known to inflate your content by about one third.

Community
  • 1
  • 1
Val
  • 207,596
  • 13
  • 358
  • 360
  • _if you reindex your folder a few times over and over during your testing, then the index size will grow because you'll have a large amount of deleted documents._ But I'm deleting the existing one and creating a new index everytime. So does this effect ? I'm using on "english"analyzer and nothing apart from it. As I'm indexing documents, the text content is comparatively very high. – ASN May 30 '16 at 04:11
  • Try to index your documents without analyzer, without positions and offsets, without storing the content, i.e. with the simplest settings as possible. How much do you get? Then add storage and compare. Then add the analyzer and compare. Then add the positions/offsets and compare, etc. You'll see that each "additional setting" of your index adds up to your index size. It's perfectly normal. – Val May 30 '16 at 04:14
  • Okok. But what surprised me is the fact that the change in id generation while indexing. When I indexed the docs using auto generated id's , it took more space than the documents indexing with custom id's. (may be I might have missed something while using auto id's)So wanted to know if it depends on that also. so thats why is the post. – ASN May 30 '16 at 04:17
  • 1
    The autogenerated IDs are [Flake IDs](http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html) so if your IDs are smaller in size (like 1, 2, 3, 4, etc) then they will take less space for sure. – Val May 30 '16 at 04:33
  • Hi val. As of now I'm going with the auto generated ID's as using custom Id's will have some problem with my application. While using auto Id's is there any probability of the id's repeating. for eg: today I run the application and it created some 10 id's and I stopped the application. And tomorrow if I run the application again, is there any probability of Id's repeating as today's id's (not all of them though) – ASN Jun 03 '16 at 01:18
  • Shall we consider this question closed then? – Val Jun 03 '16 at 04:08
  • So basically it depends on many factors while indexing. So is there anyway to reduce the size without loosing my requirement? (can something be changes in the above mapping style to reduce the size?) – ASN Jun 03 '16 at 05:29