trim whitespace inside angle brackets in sed

Question

I actually solved this while composing the question but I think it could be neater than the way I did it.

I wanted to trim whitespace and most punctation except url legal stuff (from rdf/n3 entities) that appears inside <>s.

An example of the source text would be:
<this is a problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .

The output needs to convert spaces to underscores and trim quotes and anything that isn't legal in a url/iri.

<http://This is a "problem"> => <http://This_is_a_problem>

These didn't work.
sed -e 's/\(<[^ ]*\) \(.*>\)/\1_\2/g' badDoc.n3 | head sed '/</,/>/{s/ /_/g}' badDoc.n3 | head

My eventual solution, that seems to work, is:
sed -e ':a;s/\(<[^> ]*\) \(.*>\)/\1_\2/g;ta' badDoc.n3 | sed -e ':b;s/\(<[:/%_a-zA-Z0-9.\-]*\)[^><:/%_a-zA-Z0-9.\-]\(.*>\)/\1\2/g;tb' > goodDoc.n3

Is there a better way?

don't get what do you want to do . what would be the output of your source text? — Kent, Mar 13 '13 at 09:57
I hope you realise that you can't change the characters within angle brackets without changing the meaning of the file. Moreoever, "'" is a reserved character in n3 and anything generating such files is broken and should be fixed. — Recurse, Mar 14 '13 at 02:44
I understand that, we're generating the n3 and although it was fixed in our import process, I was dealing with a batch of n3 that included unescaped strings in the IRI (mostly filenames including quotes), these needed to be cleaned before we could process that batch. — user1616353, Mar 14 '13 at 20:55

score 1 · Answer 1 · answered Mar 14 '13 at 21:40

First of all, I would say that this is an interesting problem. It looks a simple substitution problem, however if go into it, it is not so easy as I thought. When I was looking for the solution, I do miss vim!!!... :)

I don't know if sed is a must for this question. I would do it with awk:

awk '{t=$0;
        while (match(t,/<[^>]*>/,a)>0){
                m[++i]=a[0];n[i]=a[0];t=substr(t,RSTART+RLENGTH)
        }
        for(x in n){
                gsub(/[\x22\x27]/,"",n[x])
                gsub(/ /,"_",n[x])
                sub(m[x],n[x])
        }}1' file

test it a bit with your example:

kent$  cat file
<this is a problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .

kent$  awk '{t=$0;
        while (match(t,/<[^>]*>/,a)>0){
                m[++i]=a[0];n[i]=a[0];t=substr(t,RSTART+RLENGTH)
        }
        for(x in n){
                gsub(/[\x22\x27]/,"",n[x])
                gsub(/ /,"_",n[x])
                sub(m[x],n[x])
        }}1' file
<this_is_a_problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContainsQuotesThatWillBreakThings> "This should be 'left alone'." .

well it is not really an one-liner, see if there are other shorter solutions from others.

trim whitespace inside angle brackets in sed

1 Answers1