2

We're running a production system on Crystal/Kemal. The calling service sees quite often a Connection refused error. I was wondering how can I see more insights/metrics into a running instance of HTTP::Server/Kemal. I'm referring to the number of fibers running/waiting (out of the maximum number allowed), how large is the backlog of connections, how many have been refused and so on.

linkyndy
  • 17,038
  • 20
  • 114
  • 194

1 Answers1

1

Built-in tools: crystal tool -h

    context                  show context for given location
    expand                   show macro expansion for given location
    format                   format project, directories and/or files
    hierarchy                show type hierarchy
    implementations          show implementations for given call in location
    types                    show type of main variables

Common tools:

  1. lsof +p $(pidof <process_name>) — display connections/socket for process.
  2. ss -ier — display internal socket stats.
  3. strace -p $(pidof <process_name>) -s 300 -yyfq — useful tool for process introspection.
  4. tcpdump & wireshark — dump and explore network packets
  5. ngrep — like grep but for network packets.
  6. LLDB — native debugger for LLVM-based app (tutorial)
  7. CodeLLDB — Native VSCode debugger based on LLDB.

And don't forget crystal build ./app.cr --debug

Sergey Fedorov
  • 3,696
  • 2
  • 17
  • 21
  • Thank you! These are mostly "generic" tools, which are really valuable! But I was looking for something more specific, similar to the metrics Puma provides for Ruby, for example. – linkyndy Aug 06 '20 at 07:54
  • Could you provide a list of required metrics? But... I may be wrong, but I think you just don't know how and where to find the problem and hoping just to see any anomaly. Can you provide more details for the steps that caused the connection refuse? If you can show the source — would be perfect. – Sergey Fedorov Aug 06 '20 at 11:11
  • The only thing I'd add is profiling (first section here: https://crystal-lang.org/reference/guides/performance.html) in case it's cpu bound...you can see backlog for a particular port ex: https://www.quora.com/How-can-I-check-TCP-backlog-queue-for-a-specific-process-on-Linux – rogerdpack Aug 06 '20 at 14:24
  • I am referring to something similar to https://github.com/harmjanblok/puma-metrics. Regarding the exact metrics, I've mentioned them in my original post: the number of fibers running/waiting (out of the maximum number allowed), how large is the backlog of connections, how many have been refused. Regarding the connection refused, we have a service that tries to connect to this service and the connection is refused...the source is just an HTTP call. – linkyndy Aug 10 '20 at 09:19
  • 1
    I have more that 4 years about 2-5 services (http + tcp & messagepack) under load (70-150 rps) and some others to data processing, scraping and communication between hosts. As I remember I have never used general metrics because my load profile is different. At first I tried to solve troubles with stuck sockets and close_wait (strace, lsof/ss), then with null bytes that were randomly found in the data (wireshark) and after that tried to increase the overall efficiency to reduce the cost (main logic rewrited with fibers). But also I have near Rails app with common metrics and popular gems. – Sergey Fedorov Aug 10 '20 at 19:12
  • 2
    From my point of view for debug the Crystal app you almost always need to know state of environment but for Ruby it not necessary if CPU, Memory and HDD space enough. Maybe that's why I can't recommend a shards for collecting metrics. That is, I want to say that Ruby it's thing in itself but Crystal is native element of system and can be examined with system tools. In any case, if you have time, tell us later how you solved the problem? – Sergey Fedorov Aug 10 '20 at 19:12
  • TIL `ngrep` it's like grep but for network: https://github.com/jpr5/ngrep/blob/master/EXAMPLES.md – Sergey Fedorov Aug 23 '20 at 02:30