“sed” command to remove a line that matches an exact string on first word

Question

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word

...but only partially because that solution only works if I query pretty much exactly like the answer person answered.

They answered:

 sed -i "/^maria\b/Id" file.txt

...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.

I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt

 sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com

When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".

 ads.cnn.com
 cl.cnn.com
 cnn.com <-- the one I don't want
 cnn.dyn.cnn.com
 customad.cnn.com
 gdyn.cnn.com
 jfcnn.com
 kermit.macnn.com
 metrics.cnn.com
 projectcnn.com
 smetrics.cnn.com
 tiads.sportsillustrated.cnn.com
 trumpincnn.com
 victory.cnn.com
 xcnn.com

If I just use that one piece of code with the cnn.com chop out it seems to work.

 sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
 * I'm not using the "-e" option

Result:

 ads.cnn.com
 cl.cnn.com
 cnn.dyn.cnn.com
 customad.cnn.com
 gdyn.cnn.com
 jfcnn.com
 kermit.macnn.com
 metrics.cnn.com
 projectcnn.com
 smetrics.cnn.com
 tiads.sportsillustrated.cnn.com
 trumpincnn.com
 victory.cnn.com
 xcnn.com

Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.

Any advice?

Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2

Sorry for the frustrations here at SO. I've found it helpful in questions if you are specific to what you need. For this type of question a sample list of values and then your desired results would be very helpful. That being said, there is plenty of information here to work out a pretty thorough answer. — JNevill, May 18 '18 at 18:01
This is working with your sample `sed -re '{/#/d; /^cnn\.com\b/d; /::/d; s/^( ?(127\.0\.0\.1|0\.0\.0\.0))// }'` — LMC, May 18 '18 at 21:59
When you find yourself using pipes of multiple seds and trs and greps and awks and.... just rewrite it as the one simple awk script that's all you need. — Ed Morton, May 19 '18 at 00:27

JNevill · Accepted Answer · 2018-05-18T17:54:21.850

The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:

  sed -r '/^cnn\.com\b/d' raw.txt

The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.

As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):

  sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt

This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.

And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:

  sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt

Alternatively you can use a ; semicolon to separate out the different regexes:

  sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt

Ed Morton · Answer 2 · 2018-05-19T00:25:15.963

sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:

awk '$1 != "foo"' file

To remove lines that start with any of "foo" or "bar" is just:

awk '($1 != "foo") && ($1 != "bar")' file

If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:

awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file

If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

“sed” command to remove a line that matches an exact string on first word

2 Answers2