0

I am trying to get matched non-numerical strings on new line with sed

So, if I have string abc def 123 (ghi), I want output to be:

(abc)
(def)
(ghi)

This is what I have tried:

echo "abc def 123   (ghi)" | sed -r 's/([a-z]+)/(\1)\n/g'

But this outputs following:

(abc)
 (def)
 123   ((ghi)
)  

I am quite confused here. Have many doubts: Why there is leading space on line 2 and 3? Why double bracket ghi? Why 123 is not eliminated? Why, enclosing bracker came individually on last line?

Update

Actually, I wanted to extract URLs from specific domain. So using suggestions in comments and answer, I tried below:

in="https://www.example.com/user1 ddsf none  http://www.example.com/user2 kbu7f7yy"
echo $in | sed 's/http[s]*:\/\/www.example.com\/[^ ]*/&\n/g'

This printed following:

https://www.example.com/user1
 ddsf none http://www.example.com/user2
 kbu7f7yy

So, I tried this (as suggested in one )

echo $in | sed 's/.*\(http[s]*:\/\/www.example.com\/[^ ]*\).*/\1\n/g'

But I ended up getting:

http://www.example.com/user2
Rnj
  • 1,067
  • 1
  • 8
  • 23
  • `sed` doesn't remove the things that it didn't match. Your expression says to wrap consecutive letters in parenthesis and append a line break. The spaces are there because the "words" are separated by spaces. – Felix Kling Sep 04 '20 at 12:56
  • Try `sed -r 's/[^a-z]*([a-z]+)[^a-z]*/(\1)\n/g'`, or `sed -E 's/[^[:alpha:]]*([[:alpha:]]+)[^[:alpha:]]*/(\1)\n/g'` – Wiktor Stribiżew Sep 04 '20 at 12:56
  • @FelixKling was referring to [this](https://stackoverflow.com/questions/16675179/how-to-use-sed-to-extract-substring) – Rnj Sep 04 '20 at 12:58
  • @WiktorStribiżew so we have to match full string? – Rnj Sep 04 '20 at 13:26
  • Why I am not able to replicate it on updated example at the end of original question – Rnj Sep 04 '20 at 13:57

3 Answers3

2

Replace anything between the beginning of a line, letters, and the end of a line by ) (, then remove the surplus parentheses:

sed -r 's/[^a-z]+|^|$/) (/g;s/^\) | \($//g'

But I find the following Perl solution more readable:

perl -lne 'print "($1)" while /([a-z]+)/g'
  • -n reads the input line by line and runs the code for each line
  • -l removes newlines from input and adds them to output
choroba
  • 231,213
  • 25
  • 204
  • 289
1

This might work for you (GNU sed):

sed -E '/\n/!s/\<[[:alpha:]]+\>/\n(&)\n/g;/^\([[:alpha:]]+\)/P;D' file

This surrounds alpha strings by newlines within parens and then only prints those lines that begin with an open paren, alpha characters and a closing paren.

For urls, maybe:

sed -E '/\n/!s/https?\S+/\n&\n/g;/^https?/P;D' file

Use the -E command line option so as to use extended regexps:

  • /\n/!s/https?\S+/\n&\n/g if the current line does not contain any newlines, globally substitute strings that begin http with and an optional s for that same string surrounded by newlines.
  • /^https?/P if front of the current pattern space begins with a http with an optional s, print up to and including the next new line.
  • D delete up to and including the next new line and restart the sed cycle (without fetching the next line from the file) if the pattern space is not empty.

Thus the first time through the substitution will take place and there after the printing/deleting will occur. The pattern space will be reduced each time it is processed until it is empty and then the next line will be presented to the pattern space.

potong
  • 55,640
  • 6
  • 51
  • 83
  • `echo $in | sed '/\n/!s/https?\S+/\n&\n/g;/^https?/P;D'` gives nothing in output. – Rnj Sep 04 '20 at 14:35
  • @Rnj maybe try with the `-E` option? or insert a back slash before each `?` and `+` – potong Sep 04 '20 at 14:39
  • with `-r`, it worked, but am struggling to understand its meaning – Rnj Sep 04 '20 at 14:47
  • Please explain. Am struggling to get it all. Quite new to `sed`. – Rnj Sep 04 '20 at 15:08
  • in the first bullet, I understand `/g` is global flag, `\n&\n` is "substitute matched strings with newline surrounding", `s/` is substitute, `https?\S+` is for matching link. But what is that prefix `/\n/!`? – Rnj Sep 16 '20 at 08:10
  • @Rnj when sed places a line in the pattern space it removes the newline. The only time newlines will be in the pattern space are if they are put there by commands. The `D` command removes upto and including the first newline and then if the pattern space still contains data it restarts the sed cycle without replacing the pattern space by the next line. Thus `/\n/!s/../../` prevents the pattern space from being substituted again if it has already has been. Like a one-time-only switch. – potong Sep 16 '20 at 08:50
0

The sed can be simple: sed 's/[()0-9]//g; s/[a-z]\+/(&)\n/g; s/ //g;'

  • Remove all parens and digits
  • Surround all words in (&)\n, where & is sed shorthand for the matched word
  • Remove all spaces

This could also be done this way: grep -Pow '[a-z]+' | sed 's/.*/(&)/'

For the url example, grep is a lot easier for extracting words than sed: grep -Pow 'http\S+'

  • -P for perl matching to allow \S+ to mean 'non-space'
  • -o for only matching
  • -w for word matching (equivalent to \bhttp\S+\b)

If, for some reason you still want to add parens, grep -Pow 'http\S+' | sed s/.*/(&)/

stevesliva
  • 5,351
  • 1
  • 16
  • 39