1

I have a file with following content:

[A hi] [B hello]
[A how] [A why] [C some where]

I basically want to extract the "text" with marker 'A' I mean

hi
how
why

in a new file on separate lines. I tried using sed but I could not get the regular expression. Can someone suggest me what can I use ?

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
j10
  • 2,009
  • 3
  • 27
  • 44

3 Answers3

1

Try doing this using :

grep -oP '\[A\s+\K[^\]]+' file.txt > new_file.txt

or

grep -oP '\[A\s+\K[^\]]+' file.txt | tee new_file.txt

RESULT

hi
how
why

EXPLANATIONS

  • -o for grep stands for "get only the matching part"
  • -P for grep stands for "Perl extented regex"
  • for the \K regex trick, see Support of \K in regex (it's an advanced look-around regex trick)

The same regex in with comments :

use strict; use warnings;
use feature qw/say/;

while (<>) {
    say for 
        /           # starting regex
            \[A     # a literal "[" and "A"
            \s+     # at least one whitespace (\n, \r, \t, \f, and " ")
            \K      # restart the match
            [^\]]+  # at least one character that is not a literal "]"
        /gsx;       # end of the regex and the modifiers
}

LINKS

To learn regex, see

Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • thanks sputnick. Could you please help me in understanding this exp. also can you point me to good link to study regular exp. – j10 Jan 26 '13 at 16:08
0

I'm not sure how to do this with sed (not too familiar with it), but you could use GNU grep with Perl-compatible regular expressions (see this answer for another example).

Here's a quick regex I've put together for your test input (assuming your data is in a file named 'foo'):

cat foo | grep -Po '(?<=\[A )[^\]]+'

This outputs:

hi
how
why

update - How this works:

The first portion of the regex (?<=\[A ) uses a negative-lookbehind, which basically means you ensure this think you are looking for is preceded by something (in this case \[A). This helps give context to what you are looking for. This can also be accomplished with capture groups, but since I've not done this sort of thing before with grep, I wasn't sure how to use them here. The syntax for one of the lookbehinds is (?<=THING_TO_PRECEDE_YOUR_MATCH_WITH).

The second chunk [^\]]+ just says "find one or more characters that are not \]. Note that we have to escape the square-brackets because they mean something in regular expressions. [^CHARSET] just says anything but some given character set or character-class. The + just says find one or more of what we just mentioned.

Depending on your experience with regular expressions this may or may not have been helpful, let me know if there are any points that I could better explain. I'm not sure of the best place to learn these. Having used python a lot, I find their syntax reference quite handy. Also, google tends to point to http://www.regular-expressions.info/ a lot, but I can't say from experience how useful it is.

Community
  • 1
  • 1
Adam Wagner
  • 15,469
  • 7
  • 52
  • 66
  • Thanks Adam. Could you please help me in understanding this exp. also can you point me to good link to study regular exp. – j10 Jan 26 '13 at 16:00
  • @sputnick It's a bad habit. – Adam Wagner Jan 26 '13 at 16:18
  • @JitenShah See my update. Not sure it'll be entirely useful, but maybe I've shed a little light on what's going on. – Adam Wagner Jan 26 '13 at 16:20
  • @AdamWagner : thanks for explanation but when I write it to a file how does it print one word on each line. I do not see any '\n' in our query – j10 Jan 26 '13 at 16:24
  • @JitenShah I think that's just how grep works with regular expressions. Is that not what you wanted? – Adam Wagner Jan 26 '13 at 16:25
0

This might work for you (GNU sed):

sed -r '/\[A\s+([^]]*)\]/{s//\n\1\n/;s/[^\n]*\n//;P};D' file
potong
  • 55,640
  • 6
  • 51
  • 83