need a regular expression for extracting data and writing to file

Question

I have a file with following content:

[A hi] [B hello]
[A how] [A why] [C some where]

I basically want to extract the "text" with marker 'A' I mean

hi
how
why

in a new file on separate lines. I tried using sed but I could not get the regular expression. Can someone suggest me what can I use ?

Try matching strings like '[A hi]' first. Then capture the text using a group. — harpun, Jan 26 '13 at 15:48

score 1 · Accepted Answer · edited May 23 '17 at 10:24

Try doing this using grep :

grep -oP '\[A\s+\K[^\]]+' file.txt > new_file.txt

or

grep -oP '\[A\s+\K[^\]]+' file.txt | tee new_file.txt

RESULT

hi
how
why

EXPLANATIONS

-o for grep stands for "get only the matching part"
-P for grep stands for "Perl extented regex"
for the \K regex trick, see Support of \K in regex (it's an advanced look-around regex trick)

The same regex in perl with comments :

use strict; use warnings;
use feature qw/say/;

while (<>) {
    say for 
        /           # starting regex
            \[A     # a literal "[" and "A"
            \s+     # at least one whitespace (\n, \r, \t, \f, and " ")
            \K      # restart the match
            [^\]]+  # at least one character that is not a literal "]"
        /gsx;       # end of the regex and the modifiers
}

LINKS

To learn regex, see

thanks sputnick. Could you please help me in understanding this exp. also can you point me to good link to study regular exp. — j10, Jan 26 '13 at 16:08

score 0 · Answer 2 · edited May 23 '17 at 11:55

I'm not sure how to do this with sed (not too familiar with it), but you could use GNU grep with Perl-compatible regular expressions (see this answer for another example).

Here's a quick regex I've put together for your test input (assuming your data is in a file named 'foo'):

cat foo | grep -Po '(?<=\[A )[^\]]+'

This outputs:

hi
how
why

update - How this works:

The first portion of the regex (?<=\[A ) uses a negative-lookbehind, which basically means you ensure this think you are looking for is preceded by something (in this case \[A). This helps give context to what you are looking for. This can also be accomplished with capture groups, but since I've not done this sort of thing before with grep, I wasn't sure how to use them here. The syntax for one of the lookbehinds is (?<=THING_TO_PRECEDE_YOUR_MATCH_WITH).

The second chunk [^\]]+ just says "find one or more characters that are not \]. Note that we have to escape the square-brackets because they mean something in regular expressions. [^CHARSET] just says anything but some given character set or character-class. The + just says find one or more of what we just mentioned.

Depending on your experience with regular expressions this may or may not have been helpful, let me know if there are any points that I could better explain. I'm not sure of the best place to learn these. Having used python a lot, I find their syntax reference quite handy. Also, google tends to point to http://www.regular-expressions.info/ a lot, but I can't say from experience how useful it is.

Thanks Adam. Could you please help me in understanding this exp. also can you point me to good link to study regular exp. — j10, Jan 26 '13 at 16:00
@JitenShah See my update. Not sure it'll be entirely useful, but maybe I've shed a little light on what's going on. — Adam Wagner, Jan 26 '13 at 16:20
@AdamWagner : thanks for explanation but when I write it to a file how does it print one word on each line. I do not see any '\n' in our query — j10, Jan 26 '13 at 16:24
@JitenShah I think that's just how grep works with regular expressions. Is that not what you wanted? — Adam Wagner, Jan 26 '13 at 16:25

score 0 · Answer 3 · answered Jan 26 '13 at 22:15

0

This might work for you (GNU sed):

sed -r '/\[A\s+([^]]*)\]/{s//\n\1\n/;s/[^\n]*\n//;P};D' file

answered Jan 26 '13 at 22:15

potong

55,640
6
51
83

need a regular expression for extracting data and writing to file

3 Answers3

RESULT

EXPLANATIONS

LINKS