2

I'm looking for a way to read a fast growing logfile on a remote unix host.
The logfile occasionally gets a logswitch (e.g. starts from 0 bytes again). The reason why I can't process the logfile directly on the remote host is that the processor puts too much load on the host which must not happen. So I need to have the processing and the reading on two different hosts.

Since I'm not at home in the Java world I'd like to ask for advice how this can best be achieved.

My thoughts so far:
Have the local logfile processor (localhost) scp a logfilereader (java binary) to the remote host and start it (via an ssh connection started by the local logfile processor). The logfilereader then starts reading/tailing the logfile and serves it as a TCP stream (which can then be read by the local logfile processor).

I'm pretty sure there are more elegant javastyle approaches. Thanks for any hints.

pitseeker
  • 2,535
  • 1
  • 27
  • 33
  • Why not send the log via syslog (better rsyslog / syslog-ng over TCP to avoid packet drop)? – Ken Cheung Oct 16 '12 at 09:33
  • The application that writes the logfile is subject to a performance test. It will be started and stopped very often. I doubt that configuring syslog for it will be a comfortable approach. – pitseeker Oct 16 '12 at 09:35
  • if reading and writing are done by two different processors, you are going to hit a concurrency problem that is hard to deal with. – gigadot Oct 16 '12 at 09:36
  • Log4j + Syslog appender, on your host configure syslog to send the log to another machine which is powerful enough for your log analysis. It saves disk I/O (writing log to file) and CPU (scp, i.e. encryption which needs CPU as well) by the network bandwidth (well... if your log can use up 1Gbps bandwidth, writing to disk was also impossible). – Ken Cheung Oct 16 '12 at 09:37
  • @gigadot: Currently my processor is able to read from a local file and process it. Reading and processing is separated inside (in several threads). So I don't see a problem there. – pitseeker Oct 16 '12 at 09:38
  • @Ken Cheung: the process that writes the logfile (several actually) can be started by several people (actually even several times) and the files are at different locations. I really doubt that syslog is feasible for that (though I'm not familiar with it). – pitseeker Oct 16 '12 at 09:39
  • @pitseeker so we are talking about one process which contains several thread here? i intially thought that you have a separated process that so the outputting and another process which reads the file. – gigadot Oct 16 '12 at 09:41
  • Your remote host is generating last number of log lines, currently write to a log file. Your remote host cannot process the log file on itself, thus need to transfer to your 'local host' to process it. You use 'scp' to connect from your 'local host' to your 'remote host' to access the log files, which you don't care using up CPU on your remote host to encrypt the packets. What I'm writing here is suggesting you to setup syslog to stream the logs from your 'remote host' to your 'local host', and process directly without using up any disk I/O nor CPU on your 'remote host'. – Ken Cheung Oct 16 '12 at 09:45
  • @gigadot: yes - I simplified the question actually. The processor currently has a thread for reading a local file and puts the content to a BlockingQueue read by n Threads for processing the content. I somehow need to read a remote file now. – pitseeker Oct 16 '12 at 09:45
  • @pitseeker can you be more specific with your question? how many processes are we talking about, several or a single process? is it only one logfile that gets written by several processes? you do have access to modify all the codes? your simplification made the question very unclear. the solution will depend on the details of the problem. – gigadot Oct 16 '12 at 09:46
  • I have a system processing 8000 log lines per second, digesting the log content and submitting the aggregated result into a MQ (JBoss / HornetQ) consumed by MDB for further analysis. This is what I have already done in the way stated above for sharing. That's all. :) – Ken Cheung Oct 16 '12 at 09:48
  • @Ken Cheung: ok, I'll have a look at syslog. Thanks for your hint. – pitseeker Oct 16 '12 at 09:48
  • @gigadot: I cannot modify the code that writes the logfile. But I need to process the logfile. For that I wrote a program that reads and processes it. I consists of one Thread that does the reading (currently from a local file) and several Threads that process the single lines. – pitseeker Oct 16 '12 at 09:50
  • @pitseeker can the same logfile get modified while your program is trying to process it? – gigadot Oct 16 '12 at 09:53
  • @Ken Cheung: don't I need admin rights to run syslogd and change the syslog-config? I need a solution that works for normal privileged users. – pitseeker Oct 16 '12 at 09:53
  • @gigadot: yes - the logfile is written very fast and while it's being written I want to read it. – pitseeker Oct 16 '12 at 09:58
  • @pitseeker back to my first comment, you are having concurrency problem due to two different processes that try to access/modify the same data (log file). There is no way to solve it if you cannot force both process to take turn when accessing and writing the file. – gigadot Oct 16 '12 at 10:04

2 Answers2

10

If you can run ssh on your remote host, then you could use

ssh <remote host> "tail -f <remote log file name>" > <local log file name>

Which will redirect anything written to the remote log file name to the local file name. If the remote file gets erased, you get a message saying that the remote file was truncated.

  • That's a simple solution and it's not bad. However I won't get the "truncation message" since the file will not be deleted. It will just get empty and start again. "tail -f" will run forever then and not notice the problem. – pitseeker Oct 16 '12 at 09:56
  • 1
    I got this message when I reset the file to empty, without deleting the file. If the file gets erased, the logging stops. – Vincent Nivoliers Oct 16 '12 at 10:33
1

If you need to read the log file online (i.e. as the messages come in), I suggest to examine ways to offer the messages via TCP instead (or in addition) to writing them into a file.

If the remote app uses a logging framework, then this is usually just a few lines in the configuration.

This will also reduce load on the remote host since it doesn't have to write any data to disk anymore. But that's usually only a problem when the remote process accesses the disk a lot to do it's work. If the remote process talks a lot with a database, this can be counterproductive since the log messages will compete with the DB queries for network resources.

On the positive side, this makes it easier to be sure you process each log message at most once (you might lose some if your local listener is restarted).

If that's not possible, run tail -f <logfile> via ssh (as Vicent suggested in the other answer). See this question for SSH libraries for Java if you don't want to use ProcessBuilder.

When you read the files, the hard tasks is to make sure that you process each log message exactly once (i.e. you don't miss any and that you don't process them twice). Depending on how the log rotation works and how your remote process creates log files, you might lose a couple of messages when they are switched.

If you don't need online processing (i.e. seeing yesterdays messages is enough), try rsync to copy the remote folder. rsync is very good at avoiding duplicate transfers and it works over ssh. That will give you a local copy of all log files which you can process. Of course, rsync is too expensive to handle the active log file, so that's the file which you can't examine (hence the limitation that this is only possible if you don't need online processing).

One final tip: Try to avoid transmitting useless log messages. It's often possible to reduce the load many times by filtering the log files with a very simple script before your transfer it.

Community
  • 1
  • 1
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820