2

I have an html file which I basically try to remove first occurences of <...> with sub/gsub functionalities.

I used awk regex . * + according to match anything between < >. However first occurence of > is being escaped (?). I don't know if there is a workaround.

sample input file.txt (x is added not to print empty):

<div>fruit</div></td>x
<span>banana</span>x
<br/>apple</td>x

code:

awk '{gsub(/^<.*>/,""); print}' file.txt

current output:

x
x
x

desired output:

fruit</div></td>x
banana</span>x
apple</td>x

2 Answers2

3

With your shown samples, please try following awk code. Simple explanation would be, using sub substitute function of awk programing. Then substituting starting < till(using [^>] means till first occurrence of > comes) > including > with NULL in current line, finally print edited/non-edited line by 1.

awk '{sub(/^<[^>]*>/,"")} 1' Input_file


2nd solution: Using match function of awk here match values from 1st occurrence of < to till 1st occurrence of > and print the rest of line.

awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH)}' Input_file

OR In case you have lines which are not starting from < and you want to print them also then use following:

awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH);next} 1' Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    Thank you for explanation, it works. I will accept the answer. Furthermore, do you have any link/reference suggestion to study? – Ahmet Said Akbulut Sep 02 '21 at 08:09
  • 2
    @AhmetSaidAkbulut, Your welcome, you could see section https://stackoverflow.com/tags/awk/info for detailed information of books links references, along with that you can check daily Q & A for `awk` tag for understanding it more, cheers and happy learning. – RavinderSingh13 Sep 02 '21 at 08:10
1

However first occurence of > is being escaped (?).

No, you got result as is due to that in GNU AWK as manual say

awk(...)regular expressions always match the leftmost, longest sequence of input characters that can match

this is called greedy in other languages' regular expressions usage, so say for

<div>fruit</div></td>x

/^<.*>/ does match

<div>fruit</div></td>

thus you end with x. In languages supporting so-called non-greedy matching you can harness it in such case, for example in ECMAScript

let str = "<div>fruit</div></td>x";
let out_str = str.replace(/^<.*?>/, "");
console.log(out_str);

output

fruit</div></td>x

As GNU AWK manual say in GNU AWK it is always longest (greedy), thus you have to use [^>] i.e. all but > to prevent match spanning from first < to last > which would contain > inside.

Daweo
  • 31,313
  • 3
  • 12
  • 25