How to edit the lines in text file in Linux - format the date to YYYY-MM-DD and then grep the line by time period

Question

Can anyone help to format this text file(YYYYMMDD) as a date formatted(YYYY-MM-DD) text file using bash script or in Linux command line? I am not sure how to start editing 23millon lines!!!

I have YYYYMMDD format textfile :-

3515034013|50008|20140601|20240730

and I want to edit like YYYY-MM-DD formatted text file(Only 3rd and 4th fields need to be changed for 23million lines):-

3515034013|50008|2014-06-01|2024-07-30

I Want to convert from YYYYMMDD formatted text file to the YYYY-DD-MM format and I want to get specific lines from the text file based on the time period after this file manipulation which is the end goal.

The end goal is to format the 3rd field and 4th field as YYYY-MM-DD and also want to grep the line by date from that formatted text file:- 03rd field is the start date and the 04th field is the end date Let's say for example I need, (01). The end date(04th field) before today i.e 2022-08-06 - all the old lines (02). The end date(04th field) is 2 years from now i.e lines in between 2022-08-06th <-> 2024-08-06th?

Please note:- There are more than a 23million lines to edit and analyze based on the date.

How to approach this problem statement? which method is time efficient awk or sed or Bash line-by-line editing?

`I Want to convert from YYYYMMDD formatted text file to the YYYY-DD-MM format and I want to get specific lines from the text file based on the time period after this file manipulation which is the end goal.` Why can't you do this with the original file? The transformation just adds `-`, nothing else. — Lars Fischer, Aug 06 '22 at 11:23
Regarding `and I want to get specific lines from the text file based on the time period` - only ask 1 question at a time. In this case your example shows you wanting to change the date format so stick with that for this question and then ask a new question if you need help with this other item (but there are plenty of examples on this forum of how to do that). — Ed Morton, Aug 06 '22 at 11:36
`How to approach this problem statement? which method is time efficient awk or sed or Bash line-by-line editing?` Depending on the size of the resulting filtered data set, I would suggest to do a *crude filtering* first. Maybe you could use sed or awk on the the year of one of the date fields to get the number of records down to a *much smaller* number. And then do the transformation to the date format of your end application on that smaller data set. — Lars Fischer, Aug 06 '22 at 11:37
Regarding which is faster - sed or awk. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for why bash line by line editing should be avoided. — Ed Morton, Aug 06 '22 at 11:38
Hi, The end goal is to format the 3rd field and 4th field as YYYY-MM-DD and also want to grep the line by date from that formatted text file:- 03rd field is the start date and the 04th field is the end date Let's say for example I need, (01). The end date(04th field) before today i.e 2022-08-06 (02). The end date(04th field) is 2 years from now i.e on or before 2024-08-06? — AzaanD, Aug 06 '22 at 14:34
The date command on linux might be handy woth your date filtering like `date --iso-8601 --date "yesterday"` and `date --iso-8601 --date "yesterday +2 years"` — Lars Fischer, Aug 07 '22 at 12:48

score 2 · Accepted Answer · answered Aug 06 '22 at 11:28

2

$ awk '
    BEGIN { FS=OFS="|" }
    {
        for ( i=3; i<=4; i++ ) {
            $i = substr($i,1,4) "-" substr($i,5,2) "-" substr($i,7)
        }
        print
    } 
' file
3515034013|50008|2014-06-01|2024-07-30

answered Aug 06 '22 at 11:28

Ed Morton

188,023
17
78
185

steffen · Answer 2 · 2022-08-06T12:05:08.573

0

Here's one way to change the format with awk:

awk '{$3=substr($3,1,4) "-" substr($3,5,2) "-" substr($3,7,2); $4=substr($4,1,4) "-" substr($4,5,2) "-" substr($4,7,2); print}' FS='|' OFS='|'

It should work given that

| is only used for field separation
all dates have the same format

You can pipe the transformed lines to a new file or change it in place. Of course you can do the same with sed or ed. I'd go for awk because you'd be able to extract your specific lines just in the same run to an extra file.

edited Aug 06 '22 at 12:05

answered Aug 06 '22 at 10:37

steffen

16,138
4
42
81

Hi, This was useful and easy to read. Do you have any idea to grep the line by time period? The 3rd field is the start date and the 4th field is the end date. Let's say for example I need, (01). The end date(04th field) before today i.e 2022-08-06 (02). The end date(04th field) is from 2 years from now i.e 2024-08-06? – AzaanD Aug 06 '22 at 14:06
@AzaanD That's another question :) Check https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html. With `mktime()` your able to convert datetime strings to a timestamp (seconds since the epoch). With that, it's easy to select lines within a time range. Example: `$ awk 'BEGIN{print mktime("2022 08 06 12 00 00");}' ''` – steffen Aug 06 '22 at 14:39

Lars Fischer · Answer 3 · 2022-08-06T11:55:05.370

Here is a way to do it with sed. It has the same restrictions as steffens answer: | as fieldseparator and that all dates have the same format i.e. leading zeros in the month and date part.

sed -E 's/^(.*[|])([0-9]{4})([0-9]{2})([0-9]{2})[|]([0-9]{4})([0-9]{2})([0-9]{2})$/\1\2-\3-\4|\5-\6-\7/g'

Here is what the regular expression does:

^(.*[|]) captures the first part of the string from linestart (^) to a | into \1, this captures the first two columns, because the remaining part of the re matches the remaining part of the line up until lineend!
([0-9]{4})([0-9]{2})([0-9]{2})[|] captures the first date field parts into \2 to \4, notice the [|]
([0-9]{4})([0-9]{2})([0-9]{2})$ does the same for the second date column anchored at lineend ($) and captures the parts into \5 to \7, notice the $
the replacement part \1\2-\3-\4|\5-\6-\7 inserts - at the different places
the capturing into \n happens because of the use of (...) parens in the regular expression.

If you run this and the awk based answer on you file, I would be interested which one is faster. — Lars Fischer, Aug 06 '22 at 11:20

score 0 · Answer 4 · answered Aug 06 '22 at 13:38

0

This might work for you (GNU sed):

sed -E 's/^([^|]*\|[^|]*\|....)(..)(..\|....)(..)/\1-\2-\3-\4-/' file

Pattern match and insert - where desired.

Or if the file is only 4 columns:

sed -E 's/(..)(..\|....)(..)(..)$/-\1-\2-\3-\4/' file

answered Aug 06 '22 at 13:38

potong

55,640
6
51
83

How to edit the lines in text file in Linux - format the date to YYYY-MM-DD and then grep the line by time period

4 Answers4