How to use sed to print from the beginning of the file till all the matching elements. The file size is greater than 25GB

Question

I am not being able to use sed to print the content of the file from the beginning until the matching pattern as to when it finds the first occurrence of the pattern it stops and does not print all the matching patterns.

As the file size is greater than 25GB. However, below is a small example of the problem.

Eg: The content of the file is:

2010T10:11:12 some data.
2012T10:11:12 some data.
2013T10:11:12 They all are different data
2014T10:11:12 Logs basically
2014T10:11:12 Error Logs
2014T10:11:12 Any Data
2014T10:11:12 Data
2015T10:11:12 Some fields
2016T10:11:12 etc

Basically, when I give the range from 2010T10:11:12 - 2014T10:11:12 it should print till the 7th line of the file.

The command I am using for printing is:

sed -n '1,/2014T10:11:12/p' File-1.txt

Output:

2010T10:11:12 some data.
2012T10:11:12 some data.
2013T10:11:12 They all are different data
2014T10:11:12 Logs basically

Expected Output:

2010T10:11:12 some data.
2012T10:11:12 some data.
2013T10:11:12 They all are different data
2014T10:11:12 Logs basically
2014T10:11:12 Error Logs
2014T10:11:12 Any Data
2014T10:11:12 Data

This command duplicates the first line of the match pattern:

sed -n '1,/2014T10:11:12/p;/2014T10:11:12/p' File-1.txt

Output:

2010T10:11:12 some data.
2012T10:11:12 some data.
2013T10:11:12 They all are different data
2014T10:11:12 Logs basically <- Duplicate line. Need to
2014T10:11:12 Logs basically <- remove any one of them
2014T10:11:12 Error Logs
2014T10:11:12 Any Data
2014T10:11:12 Data

Another issue is that the content of the file changes every second so we cannot give any range like 1-7 or 5-7. It has to be on the basis of the pattern like 2010T10:11:12 - 2014T10:11:12 or 2015T10:11:12 - 2016T10:11:12.

`1,/2014/` means that the last line that should be printed is the first one that matches the pattern. — Barmar, Aug 11 '20 at 07:44
Sorry for the wrong formatting of the file content. I didn't know how to post it as plain white text. When I had simply put the file content it displayed in a single line. — Akash, Aug 11 '20 at 07:46
If the lines are monotonically ordered and the first field is always numeric, try `awk '$1 <= 2014' File-1.txt` — tripleee, Aug 11 '20 at 07:51
The content of the file changes dynamically so it has to be on the basis of the pattern. Eg. Suppose some 2014 logs are appended and I need to search till (say) 2020. — Akash, Aug 11 '20 at 07:57
`I give the range from 2010T10:11:12 - 2014T10:11:12` There are so many question asking about filtering lines using dates range. Like https://stackoverflow.com/questions/28275880/how-to-filter-data-between-2-dates-with-awk-in-a-bash-script `sed` is not the tool for this, still you can write what you want to do with `sed`, but it's better with `awk`. Please research stackoverflow and read the other questions. — KamilCuk, Aug 11 '20 at 08:24

score 3 · Accepted Answer · answered Aug 11 '20 at 09:40

3

An alternative version of awk would be:

awk '($1 > "2014T10:11:12"){exit}1' file

This is useful when processing big files as it will stop reading the file when the first field is lexicographical bigger than "2014T10:11:12".

If you want to print a range, you can do:

awk '($1 > "2014T10:11:12"){exit}($1 >= "2013T12:12:12")' file

And when you want to overoptimse it:

awk '($1 >= "2013T12:12:12") { if($1 > "2014T10:11:12"){exit}; print}' file

answered Aug 11 '20 at 09:40

kvantour

25,269
4
47
72

Thanks, @kvantour. This optimized version is really useful. – Akash Aug 11 '20 at 11:06
@Akash The first command will print everything from the start of the file until the end condition is reached. The second will print everything from a start date till end date. The third does exactly the same as the second but does it with fewer conditionals to check. If you have N lines to print and M lines to process before the first condition is met, you will do 2(M+N)+1 condition checks in the second and only M+2N condition checks in the third. All three cases stop processing the file when the last condition is met and hence, it is assumed all files are sorted. – kvantour Aug 11 '20 at 11:23
Is there a way where we can parallelize it into different cores using some kind of tools/packages? – Akash Aug 11 '20 at 12:02
@Akash, this is really a different question. If you ask a new question and formulize it well we can assist you there. – kvantour Aug 11 '20 at 12:07
@Akash google "GNU parallel" and, as kvantour says, ask a new question if that's not all the advice you need. – Ed Morton Aug 11 '20 at 15:03
@Akash and in case you're wondering hot to pass the beginning and ending values into the script - `awk -v beg='2010T10:11:12' -v end='2014T10:11:12' '$1>=beg{if ($1>end) exit; print}' file`. You might actually find it runs a bit faster if you test `$0` instead of `$1` since awk only does field splitting when a field is mentioned in the script so not specifically mentioning any field might save you a few cycles. – Ed Morton Aug 11 '20 at 15:39
1

@EdMorton So that is how you can avoid field splitting :o – kvantour Aug 11 '20 at 16:21
Right, awk only does field splitting if you need fields (i.e. mention one of them or NF in your script). – Ed Morton Aug 11 '20 at 17:33
@EdMorton Is this particular to GNU awk or POSIX? The POSIX standard just states _Before the first reference to a field in the record is evaluated, the record shall be split into fields, according to the rules in Regular Expressions, using the value of FS that was current at the time the record was read._ This does not imply that it is only done before the first reference. ... Or am I missing something here? – kvantour Aug 13 '20 at 18:35
AFAIK it's not gawk-specific and I don't see it mentioned in the gawk manual as I'd expect it to be if it was gawk-only but I don't remember where I got that info from. Let me take a look at the POSIX spec just to see if maybe you missed something..... No, I don't see it mentioned in POSIX either. If you post a question on usenet at comp.lang.awk I expect one of the gawk maintainers who hang out there would be able to provide the answer. – Ed Morton Aug 13 '20 at 19:23
I went ahead and posted the question on comp.lang.awk, here's a link where you can see it's progress in google groups in case you don't have a news reader or usenet account - see https://groups.google.com/g/comp.lang.awk/c/Rg9RlUjiRZ8/m/rdiAVDj8BQAJ. – Ed Morton Aug 13 '20 at 19:44
1

I heard back from the gawk providers and this functionality is not defined by POSIX but, though it's not documented, it does exist in gawk. Whether or not other awk versions behave this way is unknown. Also, it's not quite the all-or-nothing that I suggested - gawk will do field splitting up til the max field number used in your script (or all fields if NF is mentioned). So if your input has 20 fields but you only mention $1 then field splitting is only done up the point that the first field is identified. – Ed Morton Aug 14 '20 at 18:35

August Karlstrom · Answer 2 · 2020-08-11T14:19:16.603

2

Try this:

awk '($1 >= "2010T10:11:12") && ($1 <= "2014T10:11:12")' File-1.txt

edited Aug 11 '20 at 14:19

answered Aug 11 '20 at 07:54

August Karlstrom

10,773
7
38
60

Just `awk -FT`..... Still just `'$1 >= "2010T10:11:12/" && $1 <= "2014T10:11:12"` should also work. – KamilCuk Aug 11 '20 at 08:22
Is awk faster than sed for huge file? As I was using sed earlier. – Akash Aug 11 '20 at 08:25
@Akash `sed` is a tool for selecting based on strings. awk is a tool for selecting based on anything else. You are interested in anything else. The reason for this is simple. Just imagine one of the log-files that does not contain an entry for `2014T10:11:12`, if you would use `sed`, you would print the full file. With the above, it would only print the section satisfying the real condition you are interested in. – kvantour Aug 11 '20 at 09:35
Just a small comment to this code. There is no need to use redirection of the input. This just slows things down here as awk can handle filenames as input arguments. – kvantour Aug 11 '20 at 09:37
@kvantour Thank you for the explanation and the suggestion. That was so helpful!! I will remove the redirection operator as I had included earlier. – Akash Aug 11 '20 at 10:50
@kvantour I agree, it is more natural to pass the file as a command argument; I have corrected the code now. However, I don't see why reading from the standard input would be slower? – August Karlstrom Aug 11 '20 at 14:19
Redirecting from a file means the shell is opening the file and so if that failed it'd be handled at the shell level (so e.g. if you were also doing output redirection then the output file wouldn't be created/zapped) BUT it means awk doesn't have access to the file name in it's FILENAME variable and you can't use that approach if you have multiple input files nor if you're doing "inplace" editing. Personally, I'd never use input redirection as I don't find the "pros" to be in areas I ever care about while I do often care about the "cons". – Ed Morton Aug 11 '20 at 14:51
@Akash awk won't necessarily be faster than sed for this (depending on how each script is written wrt exiting after the last line is found) but it will be some combination of clearer, simpler, more portable, more robust, easier to enhance, etc. so it is a better choice of tool. Consider if either of the timestamps you provide doesn't exist in the input - sed will simply neverstart or stop printing since seds doing a regexp comparison which never matches whereas awk will find the closest string to it since it's doing an alphabetic string greater/lesser comparison which is much more robust. – Ed Morton Aug 11 '20 at 14:53

pii_ke · Answer 3 · 2020-08-11T09:56:37.770

2

This works.

sed -n '1,/2014T10:11:12/{p;d}; /2014T10:11:12/{p;d}; q' File-1.txt

Read about d and q command of sed here: https://www.gnu.org/software/sed/manual/sed.html#Common-Commands

edited Aug 11 '20 at 09:56

answered Aug 11 '20 at 09:50

pii_ke

2,811
2
20
30

M. Nejat Aydin · Answer 4 · 2020-08-11T11:13:56.860

You need an address range in sed:

begin='^2010T10:11:12'
end='^2014T10:11:12'

sed -n "
    /$begin/,/$end/{ p; d; }
    /$end/p
" file

This assumes the input file is sorted by the first field (the date and time).
The second command (/$end/p) is required since you want to print all lines matching the $end. The range address (/$begin/,/$end/) matches lines starting from where the $begin matches, and continues until the first line matching the $end (inclusively).

The below may be more efficient since it gives up reading the input after the last line matching the $end (the input must be sorted in order this to work).

begin='^2010T10:11:12'
end='^2014T10:11:12'

sed -n "
    /$begin/,/$end/{
        /$end/{
            :a
            p
            n
            /$end/!q
            ba
        }
        p
    }
" file

How to use sed to print from the beginning of the file till all the matching elements. The file size is greater than 25GB

4 Answers4