Change delimiter of grep command

Question

I am using grep to detect <a href="xxxx"> something here </a>
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.

So if input is like <a href="xxxx"> something here </a> it works, but if input is like

<a href="xxxx">

something here /a>

, then it doesn't. Any solutions?

This is why [regexes](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) are the wrong way to parse XML (and hence [HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). You could use Perl to read paragraphs as a 'line' and then look for anchors spread over lines. If anchors can be spread still further, then you could slurp the entire file. You might find [`ack`](http://betterthangrep.com/) of use, though you're still in danger of entering a world of pain. — Jonathan Leffler, Feb 07 '12 at 18:44

score 3 · Accepted Answer · answered Feb 07 '12 at 18:39

3

I'd use awk rather than grep. This should work:

awk '/a href="xxxx">/,/\/a>/' filename

answered Feb 07 '12 at 18:39

thekbb

7,668
1
36
61

of course - I'd much rather use an xml parser or xslt to manipulate xml. xml isn't regular so you're always fighting a losing battle trying to use a regex. – thekbb Dec 27 '12 at 21:08
Works great. I had file of 50GB to parse with custom delimiter. Grep never ended (but worked for small files), while awk did the work in several minutes. – Andrey Dec 06 '16 at 15:36

score 1 · Answer 2 · answered Feb 07 '12 at 18:45

1

I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).

answered Feb 07 '12 at 18:45

Mithrandir

24,869
6
50
66

eg `pcregrep -Mio "(.|\n)*?<.*?/a>"` (i = case insensitive, o = print only matching text) – Curtis Yallop Mar 13 '18 at 01:50
Also see https://stackoverflow.com/questions/152708/how-can-i-search-for-a-multiline-pattern-in-a-file – Curtis Yallop Mar 13 '18 at 01:50

score 1 · Answer 3 · answered Feb 07 '12 at 21:09

I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):

sed '/<[Aa][^A-Za-z]/{ :A
     /<\/[Aa]>/ bD
     N
     bA
     :D
     /\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'

Curtis Yallop · Answer 4 · 2018-03-13T15:27:50.487

perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'

So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.

I am assuming that you meant </a> rather than /a> as an xml closing delimiter.

Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.

Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:

Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)

$_ = "a{b{c}e}f";

my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
   if($& eq "{") {
      ++$level;
   } elsif($& eq "}") {
      --$level;
      if($level == 1) {
         print "Result: ".$`.$&."\n";
         $_=$'; # reset searchspace to after the match
         last;
      }
   }
}

Result: {b{c}e}

score 0 · Answer 5 · answered Mar 13 '18 at 01:29

0

Consider egrep -3 '(<a|</a>)'

"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.

answered Mar 13 '18 at 01:29

Curtis Yallop

6,696
3
46
36

score 0 · Answer 6 · edited May 23 '17 at 12:13

0

This is probably a repeat question: Grep search strings with line breaks

You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.

edited May 23 '17 at 12:13

Community

1
1

answered Feb 22 '12 at 23:33

Phani

3,267
4
25
50

Change delimiter of grep command

6 Answers6

Linked