1

Assume a sample log file like below that's written to asynchronously by several threads from a Java program.

Using a tool like grep/sed/awk/perl -- is there an easy way to parse out all info related to only, say, [ServerThread-4] lines, and any of the XML data that may come after related to that thread.

But I don't want any of the [ServerThread-10] (or other thread) lines, or any the XML data that may come after those lines for those unrelated threads.

I tried using sed and awk like this, but these aren't working, presumably since my start and end markers are the same thing:

sed -n "/\[ServerThread-4\]/,/\[ServerThread-4\]/p" file.log > file-sed.log
awk "/\[ServerThread-4\]/{flag=1;next}/\[ServerThread-4\]/{flag=0}flag" file.log > file-awk.log

Sample File:

2020-09-22 18:06:24,333 [ServerThread-10] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-10,Request,7,linecount=2
2020-09-22 18:06:24,334 [ServerThread-10] DEBUG com.company.abc.webservice.wsc - Web service transfer time out set to: 130000 (ms)
2020-09-22 18:06:24,563 [ServerThread-4] DEBUG com.company.abc.function - XYZ Function Handler: class com.company.abc.function.XYZCalcDocHandler
2020-09-22 18:06:24,563 [ServerThread-4] DEBUG com.company.abc.function - num row 1
2020-09-22 18:06:24,563 [ServerThread-4] DEBUG com.company.abc.function - TAX_PER_ITEM
2020-09-22 18:06:24,564 [ServerThread-4] DEBUG com.company.abc.function - Request XYZ XML
<SOME_XML><ITEM>THREAD-4-DATA</ITEM></SOME_XML>
2020-09-22 18:06:24,564 [ServerThread-4] DEBUG com.company.abc.function - Using transform: quote.xsl
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.function - transformXml() = 5 (ms)
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.function - Request XML
<?xml version="1.0" encoding="UTF-8"?>
<DataEnvelope xmlns="urn:inc:blah:tps:7:0">
<OtherXml>
   <Element>Thread-4-Data</Element>
</OtherXml>
</DataEnvelope>
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Request,6,linecount=1
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.webservice.wsc - Web service transfer time out set to: 130000 (ms)
2020-09-22 18:06:24,669 [ServerThread-10] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-10,Send,335,linecount=2
2020-09-22 18:06:24,669 [ServerThread-10] INFO  com.company.abc.function - Process response
2020-09-22 18:06:24,670 [ServerThread-10] DEBUG com.company.abc.function - Response XML
<DataEnvelope>
    <Login>Thread-10-User</Login>
</DataEnvelope>
2020-09-22 18:06:24,670 [ServerThread-10] DEBUG com.company.abc.function - Processing response line items
2020-09-22 18:06:24,670 [ServerThread-10] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-10,Response,1,linecount=2
2020-09-22 18:06:24,671 [ServerThread-10] DEBUG com.company.abc.function - Response XYZ XML
<CALCULATE_DATA><CLIENT>100</CLIENT><COMPANY>1000</COMPANY></CALCULATE_DATA>
2020-09-22 18:06:24,671 [ServerThread-10] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-10,Total,345,linecount=2
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Send,354,linecount=1
2020-09-22 18:06:24,923 [ServerThread-4] INFO  com.company.abc.function - Process response
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.function - Normalizing CDATA elements
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.function - Response XML
<DataEnvelope>
    <Login>Thread-4-User</Login>
</DataEnvelope>
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.function - Processing response line items
2020-09-22 18:06:24,924 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Response,1,linecount=1
2020-09-22 18:06:24,924 [ServerThread-4] DEBUG com.company.abc.function - Response XYZ XML
<CALCULATE_DATA><CLIENT>200</CLIENT><COMPANY>2000</COMPANY></CALCULATE_DATA>
2020-09-22 18:06:24,924 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Total,361,linecount=1
rziegler72
  • 23
  • 4
  • 1
    Special thanks for showing your efforts in your question, keep it up. But your question is not fully clear, could you please do add sample of output which should make question more clearer thank you. – RavinderSingh13 Oct 29 '20 at 00:45

2 Answers2

4

What you request is only possible if we assume the data for thread 4 is not intermingled with the data of another thread. There's no reason to believe that assumption will always hold, but there's nothing we can do about that.

my $target_id = 4;

my $print = 0;
while (<>) {
   if ( my ($id) = /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},d{3} \[ServerThread-(\d+)\]/a) {
      $print = $id == $target_id;
   }

   print if $print;
}

As a "one-liner":

perl -ne'
   $print = $1 == 4 if /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},d{3} \[ServerThread-(\d+)\]/a;
   print if $print;
'

As a shortened "one-liner":

perl -ne'$p = $1 == 4 if /^.{24}\[ServerThread-(\d+)\]/; print if $p'

If we allow the number to be customizable:

perl -sne'$p = $1 == $targ if /^.{24}\[ServerThread-(\d+)\]/; print if $p' -- -targ=4

The line breaks can be removed or left in.

Specifying file to process to Perl one-liner

ikegami
  • 367,544
  • 15
  • 269
  • 518
4

There are things that I'm unsure of in the question, including the form of the desired output.

If you want all lines starting with and after the one with [ServerThread-4] in it, up to any line with [ServerThread-N] where N is not a 4, then you can use Perl's range operator

perl -wne'print if /\[ServerThread-4\]/ .. /\[ServerThread-[^4]+\]/ 
                and not /\[ServerThread-( [^4][0-9]* | 4[0-9]+ )\]/x' file 

This produces the output that I assume is wanted, shown below, but see the comment .

This approach makes some assumptions so please check how it fares with what's needed. Above all, it assumes that sections for each ServerThread-N are fully segregated, separate from each other (not intermingling).

The statement that the file is "written to asynchronously by several threads" is a bit scary though; do those threads assemble their output and then write those blocks using some thread-based "locks" (or lock for each print), or send those blocks of output to the main thread which writes them in good order, or send individual lines marked somehow ...?

If threads simply dump lines to disk as they come then one cannot anymore reliably separate output sections from various threads. Even individual lines can overlap and get mangled since different threads of execution are blindly going to the same resource.

The output of the above one-liner, on the provided file

2020-09-22 18:06:24,563 [ServerThread-4] DEBUG com.company.abc.function - XYZ Function Handler: class com.company.abc.function.XYZCalcDocHandler
2020-09-22 18:06:24,563 [ServerThread-4] DEBUG com.company.abc.function - num row 1
2020-09-22 18:06:24,563 [ServerThread-4] DEBUG com.company.abc.function - TAX_PER_ITEM
2020-09-22 18:06:24,564 [ServerThread-4] DEBUG com.company.abc.function - Request XYZ XML
<SOME_XML><ITEM>THREAD-4-DATA</ITEM></SOME_XML>
2020-09-22 18:06:24,564 [ServerThread-4] DEBUG com.company.abc.function - Using transform: quote.xsl
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.function - transformXml() = 5 (ms)
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.function - Request XML
<?xml version="1.0" encoding="UTF-8"?>
<DataEnvelope xmlns="urn:inc:blah:tps:7:0">
<OtherXml>
   <Element>Thread-4-Data</Element>
</OtherXml>
</DataEnvelope>
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Request,6,linecount=1
2020-09-22 18:06:24,569 [ServerThread-4] DEBUG com.company.abc.webservice.wsc - Web service transfer time out set to: 130000 (ms)
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Send,354,linecount=1
2020-09-22 18:06:24,923 [ServerThread-4] INFO  com.company.abc.function - Process response
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.function - Normalizing CDATA elements
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.function - Response XML
<DataEnvelope>
    <Login>Thread-4-User</Login>
</DataEnvelope>
2020-09-22 18:06:24,923 [ServerThread-4] DEBUG com.company.abc.function - Processing response line items
2020-09-22 18:06:24,924 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Response,1,linecount=1
2020-09-22 18:06:24,924 [ServerThread-4] DEBUG com.company.abc.function - Response XYZ XML
<CALCULATE_DATA><CLIENT>200</CLIENT><COMPANY>2000</COMPANY></CALCULATE_DATA>
2020-09-22 18:06:24,924 [ServerThread-4] DEBUG com.company.abc.Time - CALCULATE_DATA,ServerThread-4,Total,361,linecount=1

(I've added lines with [ServerThread-14] and ...-40 to the test file)


Note the two extra conditions (in the alternation in the regex), other than the ones establishing the range. The first one, explicitly excluding the line with the very phrase that turns off the range operator, is there because

the range operator stays true until the right operand is true, AFTER which the range operator becomes false

(the emphasis -- slanted block capitals -- is from the docs)   This bites when the line with the end-marker comes right after the line that starts the range, so it is explicitly tested for.

Instead of this, we can use the fact that the range operator returns the sequence number in a range (or empty string when false), which at the last line has E0 appended to it. This is precisely so that one can check the last line of a range (when the end-marker evaluated to true). Then we can do

perl -wne' print if 
    $r = /\[ServerThread-4\]/ .. /\[ServerThread-([^4][0-9]*|4[0-9]+)\]/ 
    and not $r =~ /E0/
' file

The second extra condition is now moved into the end-marker's regex as alternation. It is needed in order to to exclude a server-thread number that starts with 4 and then has more digits (like ServerThread-40), what still makes this a bit clumsy. But if that condition isn't needed (quite possible) then this simplifies further, and using E0 does give us a nicer expression.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • @ikegami oh .. thank you. Fixed that, will now try to focus and check this ... been a little sloppy here :((. I've only now noticed your edit -- thank you. (Btw, that `-E` is a puzzle for me: It's really not a "good practice," as you've repeatedly shown, but it's a big shortcut over (possibly repeated) `CORE::say` ...) – zdim Oct 29 '20 at 06:15
  • Wow, thank you zdim and @ikegami for your responses. Some clarifications -- log4j is what's being used to write to the log files, so I presume the writes aren't done completely haphazardly. But certain "chunks" of data from different threads are intermingled in the log. What I'm trying to achieve is to be able to get rid of a lot of the "noise" in the original file from thread data I don't care about, and only focus on the data from a given thread that I do care about. – rziegler72 Oct 29 '20 at 14:05
  • That said, both of these one-liners seem to generate identical output log files, and produce the output I'm looking for: `(perl -wne'print if /\[ServerThread-4\]/ .. /\[ServerThread-[^4]+\]/ and not /\[ServerThread-( [^4][0-9]* | 4[0-9]+ )\]/x' file.log) > file-Thread-4-perl1.log` or `(perl -wne' print if $r = /\[ServerThread-4\]/ .. /\[ServerThread-([^4][0-9]*|4[0-9]+)\]/ and not $r =~ /E0/ ' file.log) > file-Thread-4-perl2.log` Thanks, consider this answered! – rziegler72 Oct 29 '20 at 14:06
  • @rziegler72 Great that it works the way you need it :) (The two one-liners are given as just different ways to do it, they are supposed to do the same). – zdim Oct 29 '20 at 22:53
  • @rziegler72, Re "*I presume the writes aren't done completely haphazardly*", Noone said haphazardly. But it's extremely likely that it doesn't use locking, so the problem we identified remains. This is a bad log format. The point is simply this: We can't write a perfect solution with this format. All solutions will fail for some log files. – ikegami Oct 30 '20 at 02:08
  • @rziegler72 I'd also expect `log4j` to behave reasonably and be sophisticated, but I'm still not sure how it fares under threaded code that produces log of the shown format. Btw, in the meanwhile I've added (and then edited) some comments which you may find useful. – zdim Nov 08 '20 at 06:57