Why would AWS suddenly stop accepting my writes (puts) to ElasticSearch?

Question

We need to denormalize 2 million records from our MySQL database to ElasticSearch. Our devops guy setup ElasticSearch on AWS. I wrote a Clojure app that grabbed the data out of MySQL, aggregated it into the format we wanted, and then did a put to ElasticSearch. I set this up on our EC2 instance, the devops guy set the AWS roles correctly, then I started the app running. After 10 minutes I did this:

curl --verbose  -d '{"query": { "match_all": {} }}' -H 'Content-Type: application/json' -X GET "https://search-samedayes01-ntt7r7b7sfhy3wu.us-east-1.es.amazonaws.com/facts-over-time/_search"

and I saw:

{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":14952,"max_score":1.0,"hits": [...

Awesome! It's working! I looked at some of the documents and they looked good.

Another 15 minutes goes by and I run the same query as above. Sad to say, I get the same result:

{"took":1,"timed_out":false,"_shards":    {"total":5,"successful":5,"failed":0},"hits":    {"total":14952,"max_score":1.0,"hits": [...

I was like, what? Why would it accept 14,952 records and then stop?

My Clojure function is set to throw an error if there are any problems;

(defn push-item-to-persistence
  [item db]
  (let [
        denormalized-id (get-in item [:denormalized-id] :no-id)
        item (assoc item :updated-at (temporal/current-time-as-datetime))
        item (assoc item :permanent-holding-id-for-item-instances (java.util.UUID/randomUUID))
        item (assoc item :instance-id-for-this-one-item (java.util.UUID/randomUUID))
        item (assoc item :item-type :deduplication)
        ]
    (if (= denormalized-id :no-id)
      (slingshot/throw+ {
                         :type ::no-denormalized-id-in-push-item-into-database
                         :item item
                         })
      (slingshot/try+
       (esd/put db "facts-over-time" "deduplicaton" (str denormalized-id) item)
       (println "We just put a document in ES.")
       (catch Object o
         (slingshot/throw+ {
                            :type ::push-item-to-persistence
                            :error o
                            :item item
                            :db db
                            }
                           ))))))

If I look at the logs, there are no errors, and I keep seeing this line printed out:

We just put a document in ES.

Now it's been over an hour, and it seems we are still stuck at 14,952 documents.

What might have gone wrong? And why don't I see an error?

Using Elastisch as the library to connect Clojure to AWS ES.

Update

Okay, now at least I see these Exceptions. I'm not clear where they are being caught. Everywhere in my code I rethrow Exceptions because I want the app to die on the first Exception. These are being caught somewhere, possibly in the Elastish library that I'm using? Or I accidentally catch and log somewhere.

But that is a somewhat trivial question. More important:

The next question would be why I'm getting these Exceptions. Where do I adjust AWS ElasticSearch so it accepts our writes at a reasonable speed.

Oct 04, 2017 6:53:44 PM org.apache.http.impl.client.DefaultHttpClient tryConnect
INFO: I/O exception (java.net.SocketException) caught when connecting to {s}->https://search-samedayes01-ntsdht7sfhy3wu.us-east-1.es.amazonaws.com:443: Broken pipe (Write failed)

Oct 04, 2017 7:09:06 PM org.apache.http.impl.client.DefaultHttpClient tryConnect
INFO: Retrying connect to {s}->https://search-samedayes01-ntsdht7sfhy3wu.us-east-1.es.amazonaws.com:443

Oct 04, 2017 6:54:13 PM org.apache.http.impl.client.DefaultHttpClient tryConnect
INFO: I/O exception (java.net.SocketException) caught when connecting to {s}->https://search-samedayes01-ntsdht7sfhy3wu.us-east-1.es.amazonaws.com:443: Broken pipe (Write failed)

Oct 04, 2017 7:09:09 PM org.apache.http.impl.client.DefaultHttpClient tryConnect
INFO: Retrying connect to {s}->https://search-samedayes01-ntsdht7sfhy3wu.us-east-1.es.amazonaws.com:443

Update 2

I started over again. About 920 documents were put to ElasticSearch successfully. And then I got:

:hostname "UnknownHost"
:type java.io.EOFException
:message "SSL peer shut down incorrectly"

What?

Also, the writes seem crazy slow. Perhaps 10 operations per second. There must be something in AWS that I can adjust that will make our ElasticSearch nodes accept more writes? I'd like to get at least 1,000 writes a second.

Update 3

So now I got it to the point where this app mostly works, but it works in the oddest way I can imagine.

I was getting a "broken pipe" message, which lead me here:

SSL peer shut down incorrectly in Java

Following that advice I did this:

(System/setProperty "https.protocols" "TLSv1.1")

Which seemed to have no effect.

But now my app does this:

Moves at a glacial speed, making perhaps 1 write to ElasticSearch per second.
Throws the "broken pipe" Exception.
Takes off like a rocket and starts writing about 15,000 requests to ElasticSearch per minute.

I'm glad it's finally working, but I'm uncomfortable with the fact that I have no idea why it is working.

Also, 15,000 requests per minute is not actually that fast. When moving 2 million documents, this takes more than 2 hours, which is terrible. However, Amazon only supports the REST interface to ElasticSearch. I've read the native protocol would be about 8 times faster. That sounds like what we need.

Instead of printing "We just put a document in ES.", can you print the response from your put action. Generally if elasticsearch stops indexing it's because the cluster is out of space or your are writing documents faster than the cluster can handle. — zachdb86, Oct 04 '17 at 18:26
Sure, good thought. I'll print out the return value of the put. Regarding the two failure modes that you mention, wouldn't they trigger Exceptions? The 'put' is a blocking operation with something like a 2 minute timeout, with an Exception at the end of the timeout. If ElasticSearch wasn't excepting writes (I mean puts) for any reason, I'd expect to see an Exception. If there is no Exception, how else would I find the problem? — LRK9, Oct 04 '17 at 18:43
I would think they would trigger an exception. Printing the response will show us if exception handling is working the way it should be. — zachdb86, Oct 04 '17 at 18:56
I restarted, output the result of 'put'. At least so far, I don't see any errors, just success: {:_index facts-over-time, :_type deduplicaton, :_id company-164253, :_version 1, :result created, :_shards {:total 2, :successful 1, :failed 0}, :created true} — LRK9, Oct 04 '17 at 18:56
How big are your requests? Depending on the instance size AWS has a max request size of either 10mb or 100mb. — zachdb86, Oct 04 '17 at 19:57
The documents are generally 1 MB, sometimes 2 MB. I don't think I'm hitting that limit. If I did, I believe AWS is supposed to send back a 413 error response, which would trigger an Exception from the Apache Commons HTTP library. — LRK9, Oct 04 '17 at 20:05
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155975/discussion-between-zachdb86-and-lrk9). — zachdb86, Oct 04 '17 at 21:47

Why would AWS suddenly stop accepting my writes (puts) to ElasticSearch?

Update

Update 2

Update 3

0 Answers0