I'm trying to read file line by line to pull out all anchor tags in captured groups.
So far, I have:
regex="(<a href=\")([A-Za-z0-9:/._-]+)\".*(<\/a>)"
while read line; do
if [[ $line =~ $regex ]]; then
#echo ${BASH_REMATCH}
href=${BASH_REMATCH[2]}
echo $href
fi
done < file.txt
And while this almost works, as I am capturing the url as required, the problem I'm having is when a line contains two or more anchor <a>
tags, at that point, my regex is ineffective as only the first anchor tag is captured.
So, unknown to me, there must be a way of capturing all repeated groups.
Example text would be:
This paragraph has only one anchor tag, <a href="http://google.com" target="_blank">google</a>, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Some paragraph with a lot of anchor tags, <a href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">regular expression</a>, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <a href="http://en.wikipedia.org/wiki/Bash_(Unix_shell)" target="_blank">Bash</a>. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <a href="http://stackoverflow.com/questions/ask" target="_blank">asking</a>, lorem ipsum dolor sit amet <a href="http://en.wikipedia.org" target="_blank">wikipedia</a>
You will find that the results of running my bash script on the above text as file.txt
is":
http://google.com
http://en.wikipedia.org/wiki/Regular_expression
...and if you uncomment the line #echo ${BASH_REMATCH}
, you'll see the whole paragraph is matched, with only the first anchor captured.
How can I continue to capture all anchor patterns in the paragraph?
Thanks for your time!