For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex
in-place editor to substitute on all lines (%
) by: ex +"%s/pattern/replace/g"
.
The substitution pattern consists of 3 parts:
- Select from the beginning of line till
>
(^[^>].*>
) for removal, right before the 2nd part.
- Select our main part till
<
(([^<]+)
).
- Select everything else after
<
for removal (<.*
).
- We replace the whole matching line with
\1
which refers to pattern inside the brackets (()
).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d
.
- Finally, print the current buffer on the screen by
+%p
.
- Then silently (
-s
) quit without saving (-c "q!"
), or save into the file (-c "wq"
).
When tested, to replace file in-place, change -scq!
to -scwq
.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also: