Bash RexEx: Read file line by line to pull out each href in captured groups

Question

I'm trying to read file line by line to pull out all anchor tags in captured groups.

So far, I have:

regex="(<a href=\")([A-Za-z0-9:/._-]+)\".*(<\/a>)"
while read line; do    
    if [[ $line =~ $regex ]]; then
        #echo ${BASH_REMATCH}
        href=${BASH_REMATCH[2]}
        echo $href
    fi
done < file.txt

And while this almost works, as I am capturing the url as required, the problem I'm having is when a line contains two or more anchor <a> tags, at that point, my regex is ineffective as only the first anchor tag is captured.

So, unknown to me, there must be a way of capturing all repeated groups.

Example text would be:

This paragraph has only one anchor tag, <a href="http://google.com" target="_blank">google</a>, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 

Some paragraph with a lot of anchor tags, <a href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">regular expression</a>, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <a href="http://en.wikipedia.org/wiki/Bash_(Unix_shell)" target="_blank">Bash</a>. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <a href="http://stackoverflow.com/questions/ask" target="_blank">asking</a>, lorem ipsum dolor sit amet <a href="http://en.wikipedia.org" target="_blank">wikipedia</a>

You will find that the results of running my bash script on the above text as file.txt is":

http://google.com
http://en.wikipedia.org/wiki/Regular_expression

...and if you uncomment the line #echo ${BASH_REMATCH}, you'll see the whole paragraph is matched, with only the first anchor captured.

How can I continue to capture all anchor patterns in the paragraph?

Thanks for your time!

score 2 · Accepted Answer · answered Jun 28 '14 at 20:17

You can use a while loop to capture all matches

regex="<a href=\"([A-Za-z0-9:/._-]+)\"[^<]*<\/a>(.*$)"                                                                                                
while read line; do                                                                                                                                   
    while [[ $line =~ $regex ]]; do                                                                                                                   
        href=${BASH_REMATCH[1]}                                                                                                                       
        line=${BASH_REMATCH[2]}                                                                                                                       
        echo $href                                                                                                                                    
    done                                                                                                                                              
done < file.txt

prints

http://google.com
http://en.wikipedia.org/wiki/Regular_expression
http://stackoverflow.com/questions/ask
http://en.wikipedia.org

fejese · Answer 2 · 2014-06-28T20:04:34.383

1

Did you try grep -o? That would print the matches only.

grep -Po '(?<=<a href=\")([A-Za-z0-9:/._-]+)(?=\".*?<\/a>)' file.txt

-P turns on perl compatible regex
-o returns only the matched patterns not whole lines
(?<=...) positive look behind: matches a position that is preceded by this pattern
(?=...) positive lookahead: matches a position that is followed by this pattern
.*? non greedy matching: so you won't end up with a match from the first opening <a> tag to the last closing </a> tag

Using lookahead and look behind you do not match the surrounding pattern just require their presence. This makes grep -o output exactly what you need.

Just a note: this approach is very flaky, comments etc are not understood. If you need this tool for something important, use an xml/html parser instead

edited Jun 28 '14 at 20:04

answered Jun 28 '14 at 19:57

fejese

4,601
4
29
36

1

+1 for pointing out that regex isn't the right tool for the job. – Barton Chittenden Jun 28 '14 at 20:00
Running your solution produces grep help text: usage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]] ...etc – asking Jun 28 '14 at 20:04
@BartonChittenden Why do you say regex is not the right tool for the job, please expand. – asking Jun 28 '14 at 20:28
1

@asking, what platform you're on? For me it works just fine. Regex is not the best tool for this because you just can't think of all the different syntaxes. Just think about that tags can span across multiple lines they can be commented out completely or just partly. eg. ` ... some context text` – fejese Jun 28 '14 at 20:59
@fejese Thanks, I'm on OS X. Understood, re, regex is not the best tool for the job. Still, the format of the above grep doesn't work for me. – asking Jun 28 '14 at 21:40
@fejese instead of using lookbehind you can use ``\K`` like this `` – Aleks-Daniel Jakimenko-A. Jun 29 '14 at 03:33
1

@asking: see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Barton Chittenden Jun 29 '14 at 15:55
@Aleks-DanielJakimenko: thanks for the tip, did not know about this sequence. Interestingly though I can't find an equivalent alternative for the lookahead. – fejese Jun 29 '14 at 16:46
@fejese exactly, there is no such alternative for lookahead. Also please note that lookaround assertions do not support expressions with variable length. You can do this ``grep -Po 'foo.*\Kworld' <<< 'foohelloworldbar'`` but you cannot do this ``grep -Po '(?<=foo.*)world' <<< 'foohelloworldbar'``. Have a nice day ;) – Aleks-Daniel Jakimenko-A. Jun 30 '14 at 02:28

Bash RexEx: Read file line by line to pull out each href in captured groups

2 Answers2