gsub: remove till first occurence instead of last occurence of a given character in a line

Question

I have an html file which I basically try to remove first occurences of <...> with sub/gsub functionalities.

I used awk regex . * + according to match anything between < >. However first occurence of > is being escaped (?). I don't know if there is a workaround.

sample input file.txt (x is added not to print empty):

<div>fruit</div></td>x
<span>banana</span>x
<br/>apple</td>x

code:

awk '{gsub(/^<.*>/,""); print}' file.txt

current output:

x
x
x

desired output:

fruit</div></td>x
banana</span>x
apple</td>x

RavinderSingh13 · Accepted Answer · 2021-09-02T08:08:26.030

3

With your shown samples, please try following awk code. Simple explanation would be, using sub substitute function of awk programing. Then substituting starting < till(using [^>] means till first occurrence of > comes) > including > with NULL in current line, finally print edited/non-edited line by 1.

awk '{sub(/^<[^>]*>/,"")} 1' Input_file

2nd solution: Using match function of awk here match values from 1st occurrence of < to till 1st occurrence of > and print the rest of line.

awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH)}' Input_file

OR In case you have lines which are not starting from < and you want to print them also then use following:

awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH);next} 1' Input_file

edited Sep 02 '21 at 08:08

answered Sep 02 '21 at 08:03

RavinderSingh13

130,504
14
57
93

1

Thank you for explanation, it works. I will accept the answer. Furthermore, do you have any link/reference suggestion to study? – Ahmet Said Akbulut Sep 02 '21 at 08:09
2

@AhmetSaidAkbulut, Your welcome, you could see section https://stackoverflow.com/tags/awk/info for detailed information of books links references, along with that you can check daily Q & A for `awk` tag for understanding it more, cheers and happy learning. – RavinderSingh13 Sep 02 '21 at 08:10

score 1 · Answer 2 · answered Sep 02 '21 at 10:29

However first occurence of > is being escaped (?).

No, you got result as is due to that in GNU AWK as manual say

awk(...)regular expressions always match the leftmost, longest sequence of input characters that can match

this is called greedy in other languages' regular expressions usage, so say for

<div>fruit</div></td>x

/^<.*>/ does match

<div>fruit</div></td>

thus you end with x. In languages supporting so-called non-greedy matching you can harness it in such case, for example in ECMAScript

let str = "<div>fruit</div></td>x";
let out_str = str.replace(/^<.*?>/, "");
console.log(out_str);

output

fruit</div></td>x

As GNU AWK manual say in GNU AWK it is always longest (greedy), thus you have to use [^>] i.e. all but > to prevent match spanning from first < to last > which would contain > inside.

gsub: remove till first occurence instead of last occurence of a given character in a line

2 Answers2