0

If I have a file like this:

abc defghaijkb,mnaobpqa
pbqaaa
qrs - a .. b ...
cde

How to extract all the parts which start with a and end with b (I choose these chars to simplify the example, they may be replaced with some more complex regex)? This is a desired output:

ab
aijkb
aob
a .. b

(Putting each item at a separate line). Since there's no non-greedy matching (.*?) in (g)awk, I cannot find how to solve this (eg. using split).

Note 1: there will be no need to use multiline matching - that is, no newlines allowed between regex1 and regex2.

Note 2: I don’t want to use sed, I want to know if this can be done with awk, or bash, or some another command-line tool that processes an input file line-by-line... AWK seems to be a nice solution, but... if only it supported non-greedy .*?

Note 3: I cannot use grep because I am always getting memory exhausted error when I deal with huge files.

Note 4. Here is an example of a more complex regex1 and regex2. What if they can contain non-greedy .*? ? Eg. <a>.*?<b>.*?</b>.*?</a>.

Update. More complex example:

[a]text1[a]text000[b]text2[/b]text11[/a]c defgh[a]text3[b]text33[/b]text333[/a]...[/a],mnaobpqa
...[b]aa[/b]bb[/a],,,
qa - [a][b][/b][/a] aabbcc ...
cde

Desired output:

[a]text000[b]text2[/b]text11[/a]
[a]text3[b]text33[/b]text333[/a]
[a][b][/b][/a]
lyrically wicked
  • 1,185
  • 12
  • 26
  • 1
    Bash's built-in regex syntax is ERE, which doesn't support non-greedy matching. However, you can cheat. For `a.*?b`, cheating is actually *easy* (just take the matched text and truncate it at the first `b`), but we'd have to know your actual RE to say whether it's amenable to such tricks. – Charles Duffy Dec 22 '15 at 04:17
  • 1
    ...typically, I'd write that one `a[^b]*b`, which would make it even BRE-compliant. – Charles Duffy Dec 22 '15 at 04:19
  • 1
    `grep` shouldn't be exhausting your memory, unless your file contains huge runs without newlines. And if those huge runs are something like NULs, you could just strip them out first. – Charles Duffy Dec 22 '15 at 04:24
  • ...by "strip them out first", I mean something like: `tr '\0' '\n' – Charles Duffy Dec 22 '15 at 04:26
  • If you don't solve the underlying problem that's making `grep` consume all your memory, any other line-oriented program (and that includes both bash and awk, unless you explicitly use a different delimiter) will have the exact same issue; just avoiding `grep` doesn't fix anything. [Well -- if it's really NULs, it might fix *something*, as bash's read will either ignore them or terminate strings on hitting them depending on the version, but it would be an exceedingly hacky fix]. – Charles Duffy Dec 22 '15 at 04:27
  • I don't know the exact reason, but when I use `grep -Pzo '(?s)regex1.*?regex2'`, and each line is no more than 200 KB in size, I get that error. Maybe I should change that `grep` code, but how? – lyrically wicked Dec 22 '15 at 04:30
  • Ahh! Now, *that's* a useful problem description. Hmm. It'd be interesting to have an actual reproducer (with data). – Charles Duffy Dec 22 '15 at 04:32
  • That's possibly the result of pathological matching behaviour. What are regex1 and regex2? – rici Dec 22 '15 at 04:33
  • @rici: They vary, but actually, they are simple, like `.*?` – lyrically wicked Dec 22 '15 at 04:36
  • `[/a]` might not be what you think it is. Can you put your expression's meaning into English? – Amadan Dec 22 '15 at 04:39
  • @Amadan: see updated comment, that was a typo, of course – lyrically wicked Dec 22 '15 at 04:43
  • 1
    I can't reproduce a problem. Using grep -Pzo '(?s).*?', grep seems to execute in constant memory regardless of line length (although I didn't try really long matches, just really long lines). – rici Dec 22 '15 at 04:56
  • 3
    I'm voting to close this question as off-topic because it seems to be an [XY Problem](http://xyproblem.info/). There are good answers here but they don't solve the OP's *actual* problem, which is apparently doomed anyway ([parsing HTML with regular expressions](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la)). – tripleee Dec 22 '15 at 05:12
  • @tripleee: Please, wait, I must ask some additional questions here... And note that it's not really related to HTML - just replace `<>` with `[]` – lyrically wicked Dec 22 '15 at 05:17
  • It's not clear from your latest edit which regex is `a` and which is `b` and what you want to extract. But in fact it's looking more and more like I am right. If you are parsing a structured format then regex is the wrong tool. – tripleee Dec 22 '15 at 05:21
  • @tripleee: please wait, I am editing the question to show an example of what I want to extract – lyrically wicked Dec 22 '15 at 05:24
  • Avoid useage of option `-z`. – Cyrus Dec 22 '15 at 05:28
  • 1
    I'm done here -- I have cast my close vote and I stand by it. If you edit your question at this point, you need to make sure you don't change it so that the answers you have already received are rendered invalid. In fact, I would perhaps discuss deleting this question and posting a new one with your *actual* problem, but you need to understand why regular expressions are problematic for what you are apparently attempting; and look for duplicates, because this is a common problem. – tripleee Dec 22 '15 at 05:29
  • The memory exhaustion sounds like your actual regex is much too complex. If your regex is ambiguous, `grep` needs to backtrack billions of times before giving up. Each backtrack requires more memory. – tripleee Dec 22 '15 at 05:30
  • Which grep are you using and on which platform? Maybe there's a better grep available on your platform? – peak Dec 22 '15 at 05:30
  • Update done. Yes, it looks somewhat complex, but it's no problem when I open the input file in a text editor and copy such matches. I was just wondering if there was a command-line way for doing this... – lyrically wicked Dec 22 '15 at 05:40
  • With the clarifications made, I'm actually down with the close vote too. It's not the arrow-brackets that make regular expressions a bad match for SGML-derived languages, it's the context dependency and arbitrary nesting -- making it an **irregular** language, requiring more expressive power to parse than **regular** expressions can provide. – Charles Duffy Dec 22 '15 at 05:52
  • @CharlesDuffy: If not using regexes, then which way/tool should be chosen to solve such questions? – lyrically wicked Dec 22 '15 at 05:55
  • If the language is actually XML-conformant, then the obvious tool is an XPath implementation (such as `xmlstarlet` or `xmllint --xpath`). If it's not, then you end up using one of the many parser generation toolkits. – Charles Duffy Dec 22 '15 at 05:57
  • ...to extract all links, for instance, printing each to its own line: `xmlstarlet sel -t -m '//a' -c . -n` -- to print only the URLs, `xmlstarlet sel -t -m '//a' -v ./@href -n`. That gets a little more complex if it's HTML, or if namespaces are in use, but only a little. – Charles Duffy Dec 22 '15 at 05:58
  • 1
    lyricallywicked: I feel like the goalposts in this question are constantly moving. If your delimiter regexes themselves contain `.*` (whether or not it is greedy), then they are considerably more complex than a simple string match. Also, regardless of greediness, you can expect a regex search to always find the match starts first, so it is unrealistic to expect a regex to produce the first output in your updated (4) example. I'm with @CharlesDuffy here: if it is really xml, use xmlstarlet. If not, it will be more work but bison/flex would be worth looking at. – rici Dec 22 '15 at 15:44

3 Answers3

2

Pure AWK hackery:

awk 'BEGIN{RS="a"}/b/&&NR!=1{sub(/b.*/,"");if($0!~"\n")print"a"$0"b"}'
  • Split the file by a and ignore the first segment (pre-a).
  • If there is no b in the segment, ignore it.
  • Cut off everything at first b and further.
  • If there is a newline in the segment, ignore it.
  • Reconstruct the cut-off "a" and "b" and print.

I don't think you should ever, ever use this. Use perl - it is present on pretty much any system where awk is present, and makes this task a breeze:

perl -ne 'print map { "$_\n" } /a.*?b/g;'

This works even on systems whose grep does not support PCRE, as Perl by definition supports PCRE. (I don't know about the memory exhausted error - as rici says, it should not happen with non-pathological regexps.)

EDIT in response to additional questions by OP:

"capable tool" being anything that supports non-greedy operator and multiple matches per line - in this case, perl as being the best compromise between ubiquity, expressivity and speed.

The line as written is a filter - you supply input in standard input, you get output in standard output - exactly like you'd use awk or sed.

The standard regexp syntax applies: square brackets and slashes need escaping.

perl -ne 'print map { "$_\n" } /\[a\].*?\[b\].*?\[\/b\].*?\[\/a\]/g;' <infile >outfile
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • 1
    Exact same trick I would use in native bash, if the OP approves it (see first comment on the question talking about "cheating"). – Charles Duffy Dec 22 '15 at 04:20
  • This won't work if `a` can occur between an `a` and a `b`. You can fix that, but more seriously, it will fail when the `a` and `b` matches can overlap. For an obvious example, take `a` to be the string (not regex) `/*` and `b` to be `*/`, so that the goal is to produce the contents of all the C-style comments. Now try your algorithm on the line `/* ... /*/ --- /*/ ...*/` (Example is more elaborate than necessary.) – rici Dec 22 '15 at 04:31
  • @rici: That is a fair criticism. However, a) I was answering the question as posted; it provides the correct answer for the example in the OP, and b) I did insist that it is merely an intellectual exercise, and should never be used in production. – Amadan Dec 22 '15 at 04:36
  • amadan: It provides the correct answer for the example, but the question states that the regexes might be more complicated. Mine are not much more complicated, and actually in my example the open pattern does not really occur inside the enclosed a..b string (it only occurs if you match the open pattern before you try the close pattern). @charlesDuffy's proposal will work (if I understand it correctly), but it will only find one match at a time, so you need to edit the string as you go. – rici Dec 22 '15 at 04:45
  • @rici: "They vary, but actually, they are simple, like `.*?`" - my approach still works (especially if we assume the not-unreasonable conjecture that the content is well-formed HTML or XML). The point is moot though - I repeat, this should never be used, and instead a capable tool should be selected instead. – Amadan Dec 22 '15 at 04:55
  • Since "perl -ne" was mentioned, here's a similar one-liner using ruby: `ruby -ne '$_.scan( /a.*?b/ ) {|x| puts x}'` – peak Dec 22 '15 at 05:18
  • @peak: Golfed a bit, `ruby -ne 'puts $_.scan(/a.*?b/)'` is sufficient. But although I vastly prefer Ruby to Perl, it is slower to invoke, and it is not guaranteed to be present to the degree Perl is. – Amadan Dec 22 '15 at 05:20
  • @Amadan: Thank you. I have a few questions. 1. "a capable tool should be selected instead" - what tool? 2. Can you show how to specify the paths to the input and output file (I am not familiar with Perl syntax)? 3. I updated the question - see an example of a complex regex1/regex2, and replace `<>` chars with `[]` (as a string, not regex). Is it possible to modify your Perl solution with this? – lyrically wicked Dec 22 '15 at 05:21
  • @Amadan: Thank you again, I am starting to test this... sorry, I forgot to ask: what syntax would you use if a newline is not exactly `\n` - it may be `\n` **OR** `\r\n`? – lyrically wicked Dec 22 '15 at 05:45
  • The `print map { "$_\n" } /regexp/g` will find all matches of `/regexp/`, then transform them (using `map`) so that each match (`$_`) gets also a newline to it, then print out the whole mess. Line splitting of the input is done by the `-n` switch, and will work with `\r\n` just as well as with `\n`. Although, it seems `grep` with either `-P` or `-E` is also sufficient to the task. `awk` and `bash`, not really meant for this. – Amadan Dec 22 '15 at 05:48
1

This can be achieved using grep, with a modern BSD grep (such as that on Mac OS).

grep -E "a.*?b" -o file

.*? performs a non-greedy match.

On platforms with only GNU grep, -P may need to be used instead of -E; on baseline-POSIX platforms or SysV-derived Unixen, this may not work at all (as POSIX ERE does not specify non-greedy matching, and the POSIX standard for grep does not define -o).

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Reuben L.
  • 2,806
  • 2
  • 29
  • 45
  • Try it yourself: Output of `grep -E 'a.*?b' -o <<<"a yadda b yadda b"` should include only the first `yadda` if it were properly non-greedy, but actually contains both (in a single match). – Charles Duffy Dec 22 '15 at 04:22
  • 2
    The -E option of FreeBSD grep (e.g. on a Mac) enables support for non-greedy regex. (/usr/bin/grep --version => grep (BSD grep) 2.5.1-FreeBSD) – peak Dec 22 '15 at 05:08
  • Can confirm, `grep -oE` does the correct thing on Mac. – Amadan Dec 22 '15 at 05:32
  • I've updated the answer to specify the platforms on which it's applicable; with that clarified, it has my upvote. – Charles Duffy Dec 22 '15 at 16:24
1

The search itself can be written in Awk:

$ awk '{
    split($0, line, "")
    m=""
    for(i in line) {
        if(line[i] == "a")
            m=line[i]
        else if(m)
            m=m line[i]
        if(m && line[i] == "b") {
            print m
            m=""
        }
    }
}' file
ab
aijkb
aob
a .. b
John B
  • 3,566
  • 1
  • 16
  • 20