Extract the unpredictable data that have its own timestamp in a log file using a Shell script

Question

log.txt will be as below, which are the ID data with its own timestamp (detection_time) that will continuously update in this log.txt file. The ID data will be unpredictable number. It could be from 0000-9999 and the same ID could be appeared in the log.txt again.

My goal is to filter the ID that appears again in the log.txt within 15 sec from its first appearance by using shell script. Can anyone help me with this?

ID = 4231
detection_time = 1595556730 
ID = 3661
detection_time = 1595556731
ID = 2654
detection_time = 1595556732
ID = 3661
detection_time = 1595556733

To be more clear, from log.txt above, the ID 3661 first appear at time 1595556731 and then appear again at 1595556733 which is just 2 sec after the first appearance. So it is matched to my condition which is want the ID that appear again within 15sec. I would like this ID 3661 to be filtered by my shell script

The output after running the shell script will be ID = 3661

My problem is I don't know how to develop the programming algorithm in shell script.

Heres what i try by using ID_new and ID_previous variable but ID_previous=$(ID_new) detection_previous=$(detection_new) are not working

input="/tmp/log.txt"
ID_previous=""
detection_previous=""
while IFS= read -r line
do
    ID_new=$(echo "$line" | grep "ID =" | awk -F " " '{print $3}')
    echo $ID_new
    detection_new=$(echo "$line" | grep "detection_time =" | awk -F " " '{print $3}')
    echo $detection_new
    ID_previous=$(ID_new)
    detection_previous=$(detection_new)
done < "$input"

EDIT log.txt actually the data is in a set contain ID, detection_time, Age and Height. Sorry for not mention this in the first place

ID = 4231
detection_time = 1595556730 
Age = 25
Height = 182
ID = 3661
detection_time = 1595556731
Age = 24
Height = 182
ID = 2654
detection_time = 1595556732
Age = 22
Height = 184    
ID = 3661
detection_time = 1595556733
Age = 27
Height = 175
ID = 3852
detection_time = 1595556734
Age = 26
Height = 156
ID = 4231
detection_time = 1595556735 
Age = 24
Height = 184

I've tried the Awk solution. the result is 4231 3661 2654 3852 4231 which are all the IDs in the log.txt The correct output should be 4231 3661

From this, I think Age and Height data might affect to the Awk solution because its inserted between the focused data which are ID and detection_time.

This looks like a pretty general programming problem. From your question it is not clear to me whether you have problems developing an algorithm, or whether to implement the algorithm in the language of your choice. — user1934428, Aug 03 '20 at 10:50
Your recent cosmetic edits fail to address the glaring omission of any research effort. What did you search for, and what did you find? What did you try, and how did it fail? Where are you stuck? Will you understand and be satisfied with e.g. a trivial Awk solution? — tripleee, Aug 04 '20 at 08:57
Sorry for that. I try my best to explain what is the output/input on this. i'm new to this linux shell and futhermore no one around me can help so i can only ask in this website This is the output; my condition which is want the ID that appear again within 15sec. I would like this ID 3661 to be filtered by my shell script — vgags, Aug 05 '20 at 06:57

tripleee · Accepted Answer · 2020-08-10T06:05:12.523

1

Assuming the time stamps in the log file are increasing monotonically, you only need a single pass with Awk. For each id, keep track of the latest time it was reported (use an associative array t where the key is the id and the value is the latest timestamp). If you see the same id again and the difference between the time stamps is less than 15, report it.

For good measure, keep a second array p of the ones we have already reported so we don't report them twice.

awk '/^ID = / { id=$3; next }
    # Skip if this line is neither ID nor detection_time
    !/^detection_time = / { next }
    (id in t) && (t[id] >= $3-15) && !(p[id]) { print id; ++p[id]; next }
    { t[id] = $3 }' /tmp/log.txt

If you really insist on doing this natively in Bash, I would refactor your attempt to

declare -A dtime printed
while read -r field _ value
do
    case $field in
     ID) id=$value;;
     detection_time)
      if [[ dtime["$id"] -ge $((value - 15)) ]]; then
          [[ -v printed["$id"] ]] || echo "$id"
          printed["$id"]=1
      fi
      dtime["$id"]=$value ;;
    esac
done < /tmp/log.txt

Notice how read -r can easily split a line on whitespace just as well as Awk can, as long as you know how many fields you can expect. But while read -r is typically an order of magnitude slower than Awk, and you'll have to agree that the Awk attempt is more succinct and elegant, as well as portable to older systems.

(Associative arrays were introduced in Bash 4.)

Tangentially, anything that looks like grep 'x' | awk '{ y }' can be refactored to awk '/x/ { y }'; see also useless use of grep.

Also, notice that $(foo) attempts to run foo as a command. To simply refer to the value of the variable foo, the syntax is $foo (or, optionally, ${foo}, but the braces add no value here). Usually you will want to double-quote the expansion "$foo"; see also When to wrap quotes around a shell variable

Your script would only remember a single earlier event; the associative array allows us to remember all the ID values we have seen previously (until we run out of memory).

Nothing prevents us from using human-readable variable names in Awk either; feel free to substitute printed for p and dtime for t to have complete parity with the Bash alternative.

edited Aug 10 '20 at 06:05

answered Aug 05 '20 at 07:04

tripleee

175,061
34
275
318

Thank you for the answer. This might sounds funny but where should i put your Awk codes into my script. – vgags Aug 05 '20 at 07:41
This replaces the entire script. – tripleee Aug 05 '20 at 08:40
Thank you for the explanation that helps me a lot. I encounter errors `line 10: syntax error near 15))' ` and line 10: ` if [[ dtime["$id"] >= $((value - 15)) ]]; then' ` . Try to change syntax but still doesnt work. – vgags Aug 06 '20 at 06:41
Thanks for the feedback; I fixed a syntax error. But really use the Awk script instead; I added the Bash version just to point out errors in your attempt and show you roughly how much more complex it would be than the trivial Awk script. – tripleee Aug 06 '20 at 06:45
i tried your Awk solution. I've got the result that all the IDs in `log.txt` are printed out, maybe the condition which is filter the ID that has difference between the time stamps less than 15 sec doesnt work? – vgags Aug 09 '20 at 05:43
It works with the test data you provided; https://ideone.com/M3Cxmg - does your real log file perhaps have DOS line endings? See https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings – tripleee Aug 09 '20 at 07:01
For debugging, you could put in a more elaborate `print` statement which also reveals what exactly it compared when it decided to print. You could add another print to have it log the times it picks up, or add a third associative array where you store the line number (`NR`) when an item is added to `dtime` so that you can report that too in the debug print. – tripleee Aug 09 '20 at 07:08
Thank you very much for helping me and also for the tips too. I edited the question.Please take a look. i might found out why i got all the IDs in log.txt as an output instead of the IDs that only match my condition. – vgags Aug 10 '20 at 05:57
@vgags `awk` is the "Swiss-Army-Knife" of text processing and will be much faster for large files. `awk` applies each rule written in the script to each record (line) of input in the order they are written. `/^ID = / { id=$3; next }` is the first rule that matches lines that being with `"ID ="` ans sets the `id` variable to the value of the 3rd field in the line (the ID number) and skips to the `next` record. The 2nd (last) rule matches lines beginning with `"detection_time ="` and uses the `t[]` array to hold last time for the ID and the `p[]` array to track which IDs are output. – David C. Rankin Aug 10 '20 at 06:10
The [GNU Awk User's Guide](https://www.gnu.org/software/gawk/manual/html_node/index.html#SEC_Contents) is a good place to start. Learning `awk` is time well spent. – David C. Rankin Aug 10 '20 at 06:11
Thank you very much @trippleee. But if i want to access to the "Age" "Height" of the ID that match to my condition by printing with the ID too, how can i do that? I've already tried in the awk solution but somehow still struggling maybe if u can suggest me in the bash solution that would be a big help. – vgags Aug 13 '20 at 18:17
It's not hard in either script to add behavior for those fields. Probably post a new question if you can't figure it out. One compact way would be to use an associative array for the keys too. `awk '{ a[$1]id=$3 } $1 == "detection_time" { if ((a["Id"] in t) && (t[a["Id"]] >= $3-15) && !(p[a["Id"]]) { print a["Id"], a["Age"], a["Height"]; ++p[a["Id"]] } t[a["Id"]] = $3 }'` – tripleee Aug 14 '20 at 03:30
i posted a new question https://stackoverflow.com/questions/63648737/print-the-data-that-matches-to-the-condition-in-log-file .Please take a look @tripleee – vgags Aug 29 '20 at 15:33

Extract the unpredictable data that have its own timestamp in a log file using a Shell script

1 Answers1

Linked