2

I am using grep to detect <a href="xxxx"> something here </a>
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.

So if input is like <a href="xxxx"> something here </a> it works, but if input is like

<a href="xxxx">

something here /a>    

, then it doesn't. Any solutions?

user unknown
  • 35,537
  • 11
  • 75
  • 121
Zer0
  • 2,171
  • 3
  • 17
  • 18
  • 1
    Did you consider using some other tool, like XSLT ? – Basile Starynkevitch Feb 07 '12 at 18:38
  • 7
    This is why [regexes](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) are the wrong way to parse XML (and hence [HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). You could use Perl to read paragraphs as a 'line' and then look for anchors spread over lines. If anchors can be spread still further, then you could slurp the entire file. You might find [`ack`](http://betterthangrep.com/) of use, though you're still in danger of entering a world of pain. – Jonathan Leffler Feb 07 '12 at 18:44

6 Answers6

3

I'd use awk rather than grep. This should work:

awk '/a href="xxxx">/,/\/a>/' filename

thekbb
  • 7,668
  • 1
  • 36
  • 61
  • of course - I'd much rather use an xml parser or xslt to manipulate xml. xml isn't regular so you're always fighting a losing battle trying to use a regex. – thekbb Dec 27 '12 at 21:08
  • Works great. I had file of 50GB to parse with custom delimiter. Grep never ended (but worked for small files), while awk did the work in several minutes. – Andrey Dec 06 '16 at 15:36
1

I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).

Mithrandir
  • 24,869
  • 6
  • 50
  • 66
1

I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):

sed '/<[Aa][^A-Za-z]/{ :A
     /<\/[Aa]>/ bD
     N
     bA
     :D
     /\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'
yrk
  • 164
  • 9
0
perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'

So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.

I am assuming that you meant </a> rather than /a> as an xml closing delimiter.

Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.

Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:

Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)

$_ = "a{b{c}e}f";

my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
   if($& eq "{") {
      ++$level;
   } elsif($& eq "}") {
      --$level;
      if($level == 1) {
         print "Result: ".$`.$&."\n";
         $_=$'; # reset searchspace to after the match
         last;
      }
   }
}

Result: {b{c}e}

Curtis Yallop
  • 6,696
  • 3
  • 46
  • 36
0

Consider egrep -3 '(<a|</a>)'

"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.

Curtis Yallop
  • 6,696
  • 3
  • 46
  • 36
0

This is probably a repeat question: Grep search strings with line breaks

You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.

Community
  • 1
  • 1
Phani
  • 3,267
  • 4
  • 25
  • 50