Bash script to insert date from upper row if date is absent at start of row

Question

Please help me to optimize the bash script. It takes too much time to execute.

Requirements: The log file that I am working with has some rows with date at the beginning of row, and some rows are without date at the beginning of the row.
I need to insert date from upper row if date is absent at start of row.
I work in MingW64 under Windows 10.
Date is in format: 2022-06-09 17:47:08,371

Given file:
date1 string1
string2 date(just a date in log, not the date at the beginning of the row)
, string3
date2 string4
string5
]string6
date3 string7
date4 string8
date5 string9

Example of given file:

2022-06-09 10:00:01,000 string1
string2 2022-06-09 10:00:01,000 string2 2022 string2
, string3 string3 string3
2022-06-09 10:00:02,000 string4
string5
]string6 string6 string6
}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

Expected file:
date1 string1
date1 string2 date
date1 , string3
date2 string4
date2 string5
date2 ]string6
date3 string7
date4 string8
date5 string9

Example of given file:

2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 , string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string5
2022-06-09 10:00:02,000 ]string6 string6 string6
2022-06-09 10:00:02,000 }
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

my script which needs optimization
I tried the following:
I did it with loop, it is very slow

nn_lines_to_replace=$(grep  -Evn "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}"  "$file" | cut -d ":" -f1)
    for nn_line in $nn_lines_to_replace ; do
      replace=$(sed -n $(($nn_line-1))p "$file"|cut -d " " -f1-2)
      sed -i ""$nn_line" s/^/$replace/" "$file"
    done

Maybe it could be done with sed or awk.

If you have ideas how to optimize it or have better approach, please share, I will really appreciate any help

Update: I complicated the condition of this issue link

please show some sample lines with actual 'date' values since the format of said 'date' value is going to be important; do we also need to worry about a time component and if so do we copy the time component? date format ... M/D/Y, D/M/Y, Y-M-D, something else? please update the question to show sample 'date' values in both the input and the expected output — markp-fuso, Jun 14 '22 at 21:35
a single-pass solution with `awk` is going to be fast but we need to see some actual 'date' values in order to code for the correct format — markp-fuso, Jun 14 '22 at 21:37
Please take a look at [How do I format my posts using Markdown or HTML?](https://stackoverflow.com/help/formatting). — Cyrus, Jun 14 '22 at 21:47
Your solution should loop once through the file. For each line check for a date and put it in a var when it is or write the last var. You can use something like `while IFS= read -r check_d check_t remainder; do ... done < inputfile` or `awk '/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/ '{last_d_t=$1 " " $2; print; next} {print last_d_t $0}' inputfile`. — Walter A, Jun 14 '22 at 22:15
To create an example we can test with, don't repeat the string "date" a bunch of times, use actual dates. — Ed Morton, Jun 14 '22 at 22:50
`awk -v d="epoch" '{if($1$2~/^(dateregex)(timeregex)$/){d=$1" "$2}else{$1=d" "$1}}1' log.orig >log.new` — jhnc, Jun 15 '22 at 00:19
For a discussion of the problems with your code, see also https://stackoverflow.com/questions/65538947/counting-lines-or-enumerating-line-numbers-so-i-can-loop-over-them-why-is-this — tripleee, Jun 15 '22 at 05:00
Thanks @WalterA for the answer. Looks like one single quote is excessive. It works for me: `awk '/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/ {last_d_t=$1 " " $2; print; next} {print last_d_t $0}' inputfile ` — Andlas, Jun 15 '22 at 07:09
WalterA's and @jhnc solutions work for me. JYI: jhnc's code removes leading spaces of the main text. WalterA's code preserves it. — Andlas, Jun 15 '22 at 07:30

score 0 · Answer 1 · answered Jun 15 '22 at 07:05

I would harness GNU AWK for this task following way, let file.txt content be

2022-06-09 10:00:01,000 string1
string2 2022-06-09 10:00:01,000 string2 2022 string2
, string3 string3 string3
2022-06-09 10:00:02,000 string4
string5
]string6 string6 string6
}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

then

awk '/^20[0-9][0-9]-[0-9][0-9]-[0-9][0-9]/{d=substr($0, 1, 24);print;next}{print d $0}' file.txt

output

2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 , string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string5
2022-06-09 10:00:02,000 ]string6 string6 string6
2022-06-09 10:00:02,000 }
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

Explanation: For lines starting with 20 followed by 2 digits followed by - followed by 2 digits followed by - followed by 2 digits I extract 24 first characters using substr function and store that in variable d, print line as is and instruct GNU AWK to go to next line, therefore second action is applied for all lines which do not match said regular expression - for them I print them prepended by value of variable d. Disclaimer: this solution assumes you are working with dates in 2000...2099 range and that any line starting with described regular expression does contain datetime string of fixed width and for each line without datetime string there exist line with datetime earlier.

(tested in gawk 4.2.1)

score 0 · Accepted Answer · answered Jun 15 '22 at 07:18

I gave the next solution in a comment, which is working for OP:

Your solution should loop once through the file. For each line check for a date and put it in a var when it is a date, or write the last date using the var. You can use

awk '
  /^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/ {last_d_t=$1 " " $2; print; next}
 {print last_d_t $0}
' inputfile

score 0 · Answer 3 · answered Jun 15 '22 at 14:34

It has a partial regex to check for datetime

i.e. high-level digits + seps formatting,

-- instead of per digit verification

{m,g}awk '
BEGIN {
 1          _="-- ::," (__="[09]")
 1         gsub("[[, :-]",(__)(__) "&",_)
 1          sub("^", (__) (__), _)
 1         gsub( !_,       &-", _)

 1   ___=length(FS="^"(_)" ")*(_^=__="")
}
$! NF = sprintf("%.*s%s%.*s", (_!=(NF-_) * ___, __, $!_,+_<+_,
                               _ ~ NF ?"":__=substr($!_,_,___))'

|

2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string1string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 string1, string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string4string5
2022-06-09 10:00:02,000 string4]string6 string6 string6
2022-06-09 10:00:02,000 string4}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

The strings are strangely clumping together though. Oh well, c'est la vie

Bash script to insert date from upper row if date is absent at start of row

3 Answers3

Linked