0

I have 13,000 webpages with their body texts indexed. The goal is to get the top 200 phrase frequencies for one word, two word, three word... up till eight word phrases.

There are a total of over 150 million words from these webpages that need to be tokenized.

The problem is that the query takes about 15 minutes, after which it runs out of heap space, failing to complete.

I'm testing this on a 4 cpu core, 8GB RAM, SSD ubuntu server. 6GB of RAM is assigned as heap. Swap is disabled.

Now, I can do this by splitting into 8 different indices, the query / settings / mapping combination works for single-type word phrases. That is, I can run this on one-word-phrases, two-word phrases, etc alone where I get the result that I expect (though that still takes about 5 minutes each). I was wondering if there was a way to tune this full aggregation to work with my hardware with one index and query.

Settings and mappings:

{
   "settings":{
      "index":{
         "number_of_shards" : 1,
         "number_of_replicas" : 0,
         "analysis":{
            "analyzer":{
               "analyzer_shingle_2":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_2"]
               },
               "analyzer_shingle_3":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_3"]
               },
               "analyzer_shingle_4":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_4"]
               },
               "analyzer_shingle_5":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_5"]
               },
               "analyzer_shingle_6":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_6"]
               },
               "analyzer_shingle_7":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_7"]
               },
               "analyzer_shingle_8":{
                  "tokenizer":"standard",
                  "filter":["standard", "lowercase", "filter_shingle_8"]
               }
            },
            "filter":{
               "filter_shingle_2":{
                  "type":"shingle",
                  "max_shingle_size":2,
                  "min_shingle_size":2,
                  "output_unigrams":"false"
               },
               "filter_shingle_3":{
                  "type":"shingle",
                  "max_shingle_size":3,
                  "min_shingle_size":3,
                  "output_unigrams":"false"
               },
               "filter_shingle_4":{
                  "type":"shingle",
                  "max_shingle_size":4,
                  "min_shingle_size":4,
                  "output_unigrams":"false"
               },
               "filter_shingle_5":{
                  "type":"shingle",
                  "max_shingle_size":5,
                  "min_shingle_size":5,
                  "output_unigrams":"false"
               },
               "filter_shingle_6":{
                  "type":"shingle",
                  "max_shingle_size":6,
                  "min_shingle_size":6,
                  "output_unigrams":"false"
               },
               "filter_shingle_7":{
                  "type":"shingle",
                  "max_shingle_size":7,
                  "min_shingle_size":7,
                  "output_unigrams":"false"
               },
               "filter_shingle_8":{
                  "type":"shingle",
                  "max_shingle_size":8,
                  "min_shingle_size":8,
                  "output_unigrams":"false"
               }
            }
         }
      }
   },
   "mappings":{
      "items":{
         "properties":{
            "body":{
               "type": "multi_field",
               "fields": {
                  "two-word-phrases": {
                     "analyzer":"analyzer_shingle_2",
                     "type":"string"
                  },
                  "three-word-phrases": {
                     "analyzer":"analyzer_shingle_3",
                     "type":"string"
                  },
                  "four-word-phrases": {
                     "analyzer":"analyzer_shingle_4",
                     "type":"string"
                  },
                  "five-word-phrases": {
                     "analyzer":"analyzer_shingle_5",
                     "type":"string"
                  },
                  "six-word-phrases": {
                     "analyzer":"analyzer_shingle_6",
                     "type":"string"
                  },
                  "seven-word-phrases": {
                     "analyzer":"analyzer_shingle_7",
                     "type":"string"
                  },
                  "eight-word-phrases": {
                     "analyzer":"analyzer_shingle_8",
                     "type":"string"
                  }
               }
            }
         }
      }
   }
}

Query:

{
  "size" : 0,
  "aggs" : {
    "one-word-phrases" : {
      "terms" : {
        "field" : "body",
        "size"  : 200
      }
    },
    "two-word-phrases" : {
      "terms" : {
        "field" : "body.two-word-phrases",
        "size"  : 200
      }
    },
    "three-word-phrases" : {
      "terms" : {
        "field" : "body.three-word-phrases",
        "size"  : 200
      }
    },
    "four-word-phrases" : {
      "terms" : {
        "field" : "body.four-word-phrases",
        "size"  : 200
      }
    },
    "five-word-phrases" : {
      "terms" : {
        "field" : "body.five-word-phrases",
        "size"  : 200
      }
    },
    "six-word-phrases" : {
      "terms" : {
        "field" : "body.six-word-phrases",
        "size"  : 200
      }
    },
    "seven-word-phrases" : {
      "terms" : {
        "field" : "body.seven-word-phrases",
        "size"  : 200
      }
    },
    "eight-word-phrases" : {
      "terms" : {
        "field" : "body.eight-word-phrases",
        "size"  : 200
      }
    }
  }
}
HyderA
  • 20,651
  • 42
  • 112
  • 180
  • I don't think so. Shrink the `size` or run individual aggregations. Or don't run this on your laptop but on something with more RAM. And even then maybe it won't be enough. – Andrei Stefan Sep 09 '16 at 23:35

1 Answers1

1

Do you really need your entire collection in memory? Your analysis could be rewritten as a batch pipeline with a fraction of the resource requirements:

  1. Parse each crawled site and output shingles to a series of flat files: n-grams in python, four, five, six grams?
  2. Sort the shingle output files
  3. Parse the shingle output files and output shingle count files
  4. Parse all shingle count files and output a master aggregate shingle count file
  5. Sort by descending count

(This sort of thing is often done in a UNIX pipeline and parallelized.)


Or you could run it with more memory.

Community
  • 1
  • 1
Peter Dixon-Moses
  • 3,169
  • 14
  • 18