0

There is a scheduler task (every ten minutes) that synchronize changed data from Oracle to ElasticSearch, but occasionally there are SocketTimeOutException

java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-113 [ACTIVE]
    at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:139)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-113 [ACTIVE]

and from indexing_slowlog could find some doc took long time

[2020-03-07T13:30:48,768][WARN ][i.i.s.index              ] [node-1] [person_company_post/7YNrrUTmRw-oEBTXIm9NWg] took[19.8s], took_millis[19857], type[_doc], id[30131539074-201942321], routing[], source[{...}]
[2020-03-07T14:42:37,713][WARN ][i.i.s.index              ] [node-1] [company/KTngnM6ASD-_KdU0FFAWRA] took[59.9s], took_millis[59942], type[_doc], id[20019054258], routing[], source[{...}]
[2020-03-07T15:12:19,197][WARN ][i.i.s.index              ] [node-1] [person_company_post/7YNrrUTmRw-oEBTXIm9NWg] took[44.4s], took_millis[44435], type[_doc], id[30134190855-20023574217], routing[], source[{"createtime":1562918929747,"post":"shareholder","personId":30134190855,"comId":20023574217}]

but it's bulk update, why only one doc of a batch is long (100 per batch)?

BulkResponse bulkResponse = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);

and what reason cause it taking so long time for just index only one doc.

PS.

$ curl 'http://localhost:9200/_cat/indices?v'
health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   company             KTngnM6ASD-_KdU0FFAWRA   1   0   75070188     17704899     18.9gb         18.9gb
green  open   person_company_post 7YNrrUTmRw-oEBTXIm9NWg   2   0  148998126      9928098     14.3gb         14.3gb

gc log during 15:12:19

[2020-03-05T15:12:14.801+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0003225 seconds, Stopping threads took: 0.0000807 seconds
[2020-03-05T15:12:15.802+0000][22539][safepoint      ] Application time: 1.0001466 seconds
[2020-03-05T15:12:15.802+0000][22539][safepoint      ] Entering safepoint region: Cleanup
[2020-03-05T15:12:15.802+0000][22539][safepoint      ] Leaving safepoint region
[2020-03-05T15:12:15.802+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0006109 seconds, Stopping threads took: 0.0002480 seconds
[2020-03-05T15:12:15.880+0000][22539][safepoint      ] Application time: 0.0776929 seconds
[2020-03-05T15:12:15.880+0000][22539][safepoint      ] Entering safepoint region: GenCollectForAllocation
[2020-03-05T15:12:15.881+0000][22539][gc,start       ] GC(345928) Pause Young (Allocation Failure)
[2020-03-05T15:12:15.881+0000][22539][gc,task        ] GC(345928) Using 8 workers of 8 for evacuation
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) Desired survivor size 34865152 bytes, new threshold 3 (max threshold 6)
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) Age table with threshold 3 (max threshold 6)
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) - age   1:   24454792 bytes,   24454792 total
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) - age   2:    3261472 bytes,   27716264 total
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) - age   3:    8625680 bytes,   36341944 total
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) - age   4:     394608 bytes,   36736552 total
[2020-03-05T15:12:15.928+0000][22539][gc,age         ] GC(345928) - age   5:    2261016 bytes,   38997568 total
[2020-03-05T15:12:15.928+0000][22539][gc,heap        ] GC(345928) ParNew: 579714K->44446K(613440K)
[2020-03-05T15:12:15.928+0000][22539][gc,heap        ] GC(345928) CMS: 7812027K->7812027K(16095680K)
[2020-03-05T15:12:15.928+0000][22539][gc,metaspace   ] GC(345928) Metaspace: 103567K->103567K(1144832K)
[2020-03-05T15:12:15.928+0000][22539][gc             ] GC(345928) Pause Young (Allocation Failure) 8195M->7672M(16317M) 47.638ms
[2020-03-05T15:12:15.928+0000][22539][gc,cpu         ] GC(345928) User=0.29s Sys=0.00s Real=0.04s
[2020-03-05T15:12:15.928+0000][22539][safepoint      ] Leaving safepoint region
[2020-03-05T15:12:15.928+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0482442 seconds, Stopping threads took: 0.0001708 seconds
[2020-03-05T15:12:16.928+0000][22539][safepoint      ] Application time: 1.0001231 seconds
[2020-03-05T15:12:16.929+0000][22539][safepoint      ] Entering safepoint region: Cleanup
[2020-03-05T15:12:16.929+0000][22539][safepoint      ] Leaving safepoint region
[2020-03-05T15:12:16.929+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0004899 seconds, Stopping threads took: 0.0002200 seconds
[2020-03-05T15:12:17.929+0000][22539][safepoint      ] Application time: 1.0001154 seconds
[2020-03-05T15:12:17.929+0000][22539][safepoint      ] Entering safepoint region: Cleanup
[2020-03-05T15:12:17.929+0000][22539][safepoint      ] Leaving safepoint region
[2020-03-05T15:12:17.930+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0003597 seconds, Stopping threads took: 0.0000686 seconds
[2020-03-05T15:12:18.930+0000][22539][safepoint      ] Application time: 1.0001200 seconds
[2020-03-05T15:12:18.930+0000][22539][safepoint      ] Entering safepoint region: Cleanup
[2020-03-05T15:12:18.930+0000][22539][safepoint      ] Leaving safepoint region
[2020-03-05T15:12:18.930+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0003355 seconds, Stopping threads took: 0.0001001 seconds
[2020-03-05T15:12:19.930+0000][22539][safepoint      ] Application time: 1.0001240 seconds
[2020-03-05T15:12:19.930+0000][22539][safepoint      ] Entering safepoint region: Cleanup
[2020-03-05T15:12:19.931+0000][22539][safepoint      ] Leaving safepoint region
[2020-03-05T15:12:19.931+0000][22539][safepoint      ] Total time for which application threads were stopped: 0.0003592 seconds, Stopping threads took: 0.0000931 seconds
[2020-03-05T15:12:20.931+0000][22539][safepoint      ] Application time: 1.0001112 seconds

CPU usage 13:00 ~ 15:30 enter image description here

zhuguowei
  • 8,401
  • 16
  • 70
  • 106
  • did you check the GC logs, during these warnings and what is ur elasticsearch-node configuration, load, how many requests in a batch, how many replicas, size of the single doc, all these information is required to troubleshoot the issue – Amit Mar 07 '20 at 11:00
  • What about some general reason could cause index a simple doc took so long time? – zhuguowei Mar 07 '20 at 13:52
  • Does the same thing happen if you attempt to reindex the same document again? i.e. is the timing issue consistent? – MatsLindh Mar 07 '20 at 13:53
  • No it only occurs during large data synchronization – zhuguowei Mar 07 '20 at 13:59
  • @zhuguowei, whoa you have a huge number of docs in just 1 or 2 primary shards, GC logs look fine(doesn't have a long and permanent generation logs) but there is huge spike in CPU, also you have huge number of deleted docs in your index – Amit Mar 07 '20 at 14:04
  • @zhuguowei, can you follow https://stackoverflow.com/questions/60204556/optimize-api-for-reducing-the-segments-and-eliminating-es-deleted-docs-not-worki and optimize your index and again run your sync job? also how many such long requests you see in large data sync – Amit Mar 07 '20 at 14:05
  • @zhuguowei, any luck, let me know if you have further questions – Amit Mar 09 '20 at 09:30

0 Answers0