14

How can I remove the beginning of a word using grep? For example, I have a file that contains this:

www.abc.com

I only need the this part:

abc.com

Sorry for the basic question, but I have no experience with Linux.

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
Jury A
  • 19,192
  • 24
  • 69
  • 93

6 Answers6

15

You don't edit strings with grep in Unix shell, grep is usually used to find or remove some lines from the text. You'd rather use sed instead:

$ echo www.example.com | sed 's/^[^\.]\+\.//'
example.com

You'll need to learn regular expressions to use it effectively.

Sed can also edit file in-place (modify the file), if you pass -i argument, but be careful, you can easily lose data if you write the wrong sed command and use -i flag.

An example

From your comments guess you have a TeX document, and your want to remove the first part of all .com domain names. If it is your document test.tex:

\documentclass{article}
\begin{document}
www.example.com
example.com www.another.domain.com
\end{document}

then you can transform it with this sed command (redirect output to file or edit in-place with -i):

$ sed 's/\([a-z0-9-]\+\.\)\(\([a-z0-9-]\+\.\)\+com\)/\2/gi' test.tex 
\documentclass{article}
\begin{document}
example.com
example.com another.domain.com
\end{document}

Please note that:

  • A common sequence of allowed symbols followed by a dot is matched by [a-z0-9-]\+\.
  • I used groups in the regular expression (parts of it within \( and \)) to indicate the first and the second part of the URL, and I replace the entire match with its second group (\2 in the substitution pattern)
  • The domain should be at least 3rd level .com domain (every \+ repition means at least one match)
  • The search is case insensitive (i flag in the end)
  • It can do more than match per line (g flag in the end)
sastanin
  • 40,473
  • 13
  • 103
  • 130
  • The URLs are saved in a file. So my command will be: grep'\.com$' source.text >dest.tex | sed 's/^[^\.]\+\.//' ?? It gives me error ?? – Jury A Jul 26 '12 at 17:32
  • I also need to write the names (they are many lines not one) in another text file after removing www. – Jury A Jul 26 '12 at 17:37
  • I tried to guess what's your task and wrote an example of a `sed` regex to edit domain names in the document, without touching the rest of the lines. If your problem is different you may need a different regex, but overall the idea is the same. – sastanin Jul 27 '12 at 12:33
  • Normally you either redirect to file (`> dest.tex`), or just use pipe (`| sed ...`), but not both. You don't need `grep` if you want to change some lines but keep the rest. A carefully written regex and `sed` is probably all you need. – sastanin Jul 27 '12 at 12:34
  • On macOS, the `sed` command does not work the same as the Linux version. But you could use this simpler version on the Mac, without regular expressions: `echo www.example.com | sed "s/www.//"` -- It will replace `"www."` with empty string `""`. – Mr-IDE Oct 16 '19 at 10:38
9

As the others have noted, grep is not well suited for this task, sed is a good option, or if the text is well ordered a simple cut might be easier to type:

echo www.abc.com | cut -d. -f2-
  • -d. tells cut to use . as a delimiter.
  • -f2- tells cut to return field 2 to infinity.
Thor
  • 45,082
  • 11
  • 119
  • 130
7

You can do this using grep easily:

$ echo www.google.com | grep -o '[^.]*\.com'
google.com

Instead of echo you must give your file.

$ grep -o '[^.]*\.com$' < file

I used here the regular expression '[^.]*.com'. That means: find me a word without . in it ([^.]*), after which goes .com (\.com in re). The -o key says that grep must show only that part that was found.

Igor Chubin
  • 61,765
  • 13
  • 122
  • 144
6

with grep's --only-matching and \K

You can do this with grep's --only-matching option:

echo 'www.abc.com' | grep --perl-regexp --only-matching 'www\.\K.*'

which can be shortened to

echo 'www.abc.com' | grep -Po 'www\.\K.*'

Both commands produce

abc.com

with grep (GNU grep) 3.3.

Instead of echo, I'll use a here string to shorten the command further:

grep -Po 'www\.\K.*' <<< 'www.abc.com'

\K resets the starting point of the match, essentially forgetting the matched "www.". See this for more on \K.

with grep's positive lookbehind

You can also do this with a positive lookbehind:

grep -Po '(?<=www\.).*' <<< 'www.abc.com'

with awk's field separator -F

awk -F 'www\\.' '$2{print $2}' <<< 'www.abc.com'

This prints

abc.com

The $2{print $2} part will print the second field if it's defined. This is necessary in case of multi-line input to avoid outputting blank lines for input lines that don't contain the field separator.

with sed

sed --regexp-extended --quiet 's/www\.(.*)/\1/p' <<< 'www.abc.com'

The parentheses form a group which will capture everything after "www.". Using \1 we reference that group and /p prints it.

The options --regexp-extended and --quiet have the shorter equivalents -E and -n:

sed -E -n 's/www\.(.*)/\1/p' <<< 'www.abc.com'

As noted by Vladimir Nesterenco in a deleted answer, it's advisable to escape the dot with a backslash in all these regexes, to avoid matching strings that start with "www" followed by an arbitrary character, not only a dot. Otherwise, you'd extract "abc.com" from "wwwXabc.com", for example.

Depending on your input text, you might want to change the regex to make sure to only match occurrences of "www." at the beginning of a line:

^www\.

with Bash' parameter expansion

If your input consists only of a single line, Bash' built-in parameter expansion might be useful:

input="www.abc.com"; after=${input#"www."}; echo "$after"

If the input string doesn't start with "www.", this will print the whole string.

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
2

grep is not used to manipulate/change text, only to search for text/patterns within text

You should look into something like sed or awk or cut if you want a command line tool to do it. Or write a script in Python/Perl/Ruby/whatever.

Daniel DiPaolo
  • 55,313
  • 14
  • 116
  • 115
1

You can actually do this without invoking other programs, by using a builtin parameter expansion in bash:

while read line; do echo ${line#*.}; done < file

Where #*. tells the shell to remove the prefix that looks like 0 or more characters followed by a ..

You can view a cheatsheet with the different parameter expansions for bash here:

https://devhints.io/bash

clemens
  • 16,716
  • 11
  • 50
  • 65
Fahd Ahmed
  • 11
  • 1