36

I have a file that contains a list of URLs. It looks like below:

file1:

http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....

I want to get all the records after: http://www.yahoo.com, results looks like below:

file2:

http://www.baidu.com
http://www.yandex.com
....

I know that I could use grep to find the line number of where yahoo.com lies using

grep -n 'http://www.yahoo.com' file1

3 http://www.yahoo.com

But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:

Pseudocode:

grep -n 'http://www.yahoo.com' -A all file1 > file2

I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178

5 Answers5

59

AWK

If you don't mind using AWK:

awk '/yahoo/{y=1;next}y' data.txt

This script has two parts:

/yahoo/ { y = 1; next }
y

The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.

The second part is a short hand for:

y != 0 { print }

Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.

Sed

Or, using sed, the following will delete everything up to and including the line with yahoo:

sed '1,/yahoo/d' data.txt
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hai Vu
  • 37,849
  • 11
  • 66
  • 93
  • 1
    could you please explain the awk syntax a bit? My understanding: /yahoo/ search for the line using regular expression, then from that line on, create a variable called y and then set its value to 1, then if the line should be printed depends on the value of y. Then every line would be printed after yahoo. I am not quite sure of the "next" command – B.Mr.W. Aug 10 '13 at 22:52
  • My bad, I forgot to explain. Please see my update. – Hai Vu Aug 10 '13 at 23:09
  • If I am understanding it correctly it should read like this: y=0 for line in file: if (/yahoo/): y=1 go to next line if (y!=1): print line – B.Mr.W. Aug 11 '13 at 00:28
  • 1
    If you want to include the part `http://www.yahoo.com` as well, you can use `awk '/yahoo/{y=1}y' data.txt` – Kohei Nozaki May 20 '16 at 04:10
  • @KoheiNozaki Or simply: `awk '/yahoo/,0' data.txt`. And if you know that the search string is towards the end of the file, printing the remainder using `sed` would be done using `sed -n -e '/yahoo/,$p' data.txt` – ikaerom Feb 04 '19 at 21:25
  • very good explanation, thanks – Mesut GUNES Jan 28 '20 at 09:30
13

This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is

START , STOP COMMAND

except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)

So, you can do what you want like so:

sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2

The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."

This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:

sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2

which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).

zwol
  • 135,547
  • 38
  • 252
  • 361
  • What is $p? Okay, it's the STOP. When does it stop? A google search reveals nothing. The sed tutorials I've looked at do not mention it. – 7stud Dec 25 '14 at 22:15
  • @7stud In the terms I used, the STOP is just the dollar sign; the 'p' is the COMMAND. '`/.../,$`' means "do something starting with the first line matching the regular expression and continuing until the end of the file", and 'p' means 'print'. http://www.gnu.org/software/sed/manual/html_node/Addresses.html might be helpful. – zwol Dec 25 '14 at 22:43
  • 1
    *the 'p' is the COMMAND* -- Ahh. Why not write it as : `/../,$ p?` for clarity, with the format being `START,STOP COMMAND`? – 7stud Dec 25 '14 at 23:31
  • 2
    @7stud It won't work if you do that. Well, I suppose modern implementations might have relaxed the syntax, but in the *traditional* Unix Version 7 implementation, no whitespace allowed between the address and the command. – zwol Dec 26 '14 at 02:24
7
awk '/yahoo/ ? c++ : c' file1

Or golfed

awk '/yahoo/?c++:c' file1

Result

http://www.baidu.com
http://www.yandex.com
Zombo
  • 1
  • 62
  • 391
  • 407
3

This is most easily done in Perl:

perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file

In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • it works, too. Never use Perl before at all. what does the 1 .. m(search) mean, the syntax looks different from other programming languages. Not quite straight forward.. – B.Mr.W. Aug 11 '13 at 00:31
  • @user84771 It means from when the current line number is through a line that matches that search. Normally the search is with `/search/` but I didn’t want to have to escape the slashes. For example, you could say `print if 1 .. /^$/` to print up through and including a blank line. – tchrist Aug 11 '13 at 00:40
  • 2
    For those who still find this one-liner cryptic, the key is the range operator (the double-dot). In scalar context, the range operator acts as a flip-flop that maintains its own boolean state. Also, when one of its operands is a constant (like "1" above), it matches against the current line number of the input being evaluated. Details here: http://perldoc.perl.org/perlop.html#Range-Operators – billyw Apr 06 '16 at 15:00
2

Using this script:

# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`

# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`

# Subtract totallines with index
result=`expr $total - $index`

# Gives the desired output
grep -A $result "yahoo" filepath
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user1502952
  • 1,390
  • 4
  • 13
  • 27
  • 1
    Why are you reinventing what is a basic `sed` one-liner? – tripleee Aug 13 '13 at 05:21
  • 4
    was just trying to reply grep question with grep answer. – user1502952 Aug 13 '13 at 05:24
  • that is very helpful user1502952.. thanks a lot! but seems like next time I have an ad-hoc query, I would go with sed or awk :) – B.Mr.W. Aug 13 '13 at 14:22
  • Probably a pure GNU grep answer would be `grep -Pzo '.*yahoo(.*\n)*' data.txt` or in the spirit of the script, however in one line: `grep -A$(wc -l < data.txt) yahoo data.txt`. – ikaerom Feb 04 '19 at 21:38