How to find out all tag links and names from html file

Question

Here is a test file contains links and names within the <a></a> tags.

/tmp/test_html.txt

<tr>
<td><a href="http://www.example.com/link1">example link 1</a></td>
</tr>
<tr>
<td><a href="http://www.example.com/link2">example link 2</a></td>
</tr>
<tr>
<td><a href="http://www.example.com/link3">example link 3</a></td>
</tr>
<tr>
<td><a href="https://www.example.com/4/0/1/40116601-1FDC-real-world-link/bar" target="_blank" class="real-world-class">Real World Link</a>&nbsp;</td>
</tr>

The following command can find out all links from the file, but it can't print the link and name together:

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

# sed -n 's/.*href="\([^"]*\).*/\1/p' /tmp/test_html.txt

I want the command can print all links line by line with the name first, and then following the href.

Here is the expected output:

# sed <...command....> /tmp/test_html.txt

example link 1 | http://www.example.com/link1
example link 2 | http://www.example.com/link2
example link 3 | http://www.example.com/link3
Real World Link | https://www.example.com/4/0/1/40116601-1FDC-real-world-link/bar

How to write the sed command?

Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jan 31 '23 at 11:23

potong · Accepted Answer · 2023-01-31T13:23:35.753

1

This might work for you (GNU sed):

sed -En 's/.*href="([^"]*)"[^>]*>([^<]*)<.*/\2 | \1/p' file

Filter lines using the -n option and make regexp easier using -E option.

Match on lines containing href followed by inner text and format as required using back references.

edited Jan 31 '23 at 13:23

answered Jan 31 '23 at 12:14

potong

55,640
6
51
83

Thank you so much! You command works on the original simple example links. But after I copy you command and run in my real HTML file and found most of links won't show at all. I have added the real world link in the post. Please see the `Real World Link` in the post. This command won't print that link. – stackbiz Jan 31 '23 at 13:16
Thank you! It works now on both the example and my real world HTML files. Can you help to explain a little for the usage of "/p" in the sed command? I only know sed can be used to search replace but I have never used the /p syntax. – stackbiz Jan 31 '23 at 13:30
@stackbiz the `s/match/replace/p` command prints the current line with what is matched by the regex on the left-hand side with what is in the right-hand side. The `p` is a flag which is explained [here](https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command). – potong Jan 31 '23 at 13:51

score 0 · Answer 2 · answered Jan 31 '23 at 11:41

0

This solution seems to work; please mark as correct or post a comment to explain why it is not correct; thanks!

cat input3 | sed -n 's/^.*<a href="\(.*\)">example link\( [0-9][0-9]*\)<\/a><\/td>$/example link\2 | \1/p'

answered Jan 31 '23 at 11:41

Andrew

1
4
19

Because of [UUOC](https://mywiki.wooledge.org/BashFAQ/119) – Jetchisel Jan 31 '23 at 13:11

How to find out all tag links and names from html file

2 Answers2