I actually solved this while composing the question but I think it could be neater than the way I did it.
I wanted to trim whitespace and most punctation except url legal stuff (from rdf/n3 entities) that appears inside <>s.
An example of the source text would be:
<this is a problem> <this_is_fine> "this is ok too" .
<http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .
The output needs to convert spaces to underscores and trim quotes and anything that isn't legal in a url/iri.
<http://This is a "problem">
=> <http://This_is_a_problem>
These didn't work.
sed -e 's/\(<[^ ]*\) \(.*>\)/\1_\2/g' badDoc.n3 | head
sed '/</,/>/{s/ /_/g}' badDoc.n3 | head
My eventual solution, that seems to work, is:
sed -e ':a;s/\(<[^> ]*\) \(.*>\)/\1_\2/g;ta' badDoc.n3 | sed -e ':b;s/\(<[:/%_a-zA-Z0-9.\-]*\)[^><:/%_a-zA-Z0-9.\-]\(.*>\)/\1\2/g;tb' > goodDoc.n3
Is there a better way?