Is this text extraction scenario possible in linux bash shell?

Question

Let's say my text file is like this

Person1 : movie1
(space and tab) : movie 2
(space and tab) : movie 3
(space and tab) : movie 4

I want to find for a particular movie, the actor. So here is how I am going about doing this.

Do a grep cat actors | grep 'movie3'

This will give me line 3 which is an empty line up unitl movie3 appears. So if somehow I can get the first line before this particular line which follows this pattern

grep '^[^ \t].'(does not start with a space)

it has to be the line with the actor's name in this movie.(I don't care about movie one there)

Is there any combination of sed/grep/awk which can help me do it in shell? I hope the question is clear.

zx81 · Accepted Answer · 2014-06-29T11:20:20.663

3

Bill Murray <- Groundhog Day <- grep with Perl mode Magic

It's a bit tricky, but you can use this:

grep -P "(?sm)^\S+[^:\r\n]*?(?=\s*:(?:(?!^\S).)*?Groundhog Day)" mymoviefile

See demo.

-P activates Perl mode
(?sm) turns on two mode modifiers:
s activates DOTALL mode, allowing the dot to match across lines
m turns on multi-line mode, allowing ^ and $ to match on each line
The ^ anchor asserts that we are at the beginning of the line
\S+ matches one or more non-space chars
[^:\r\n]*? lazily matches any non-colon, non-newline chars, up to ...
the point where the lookahead (?=\s*:(?:(?!^\S).)*?Groundhog Day) can assert, without consuming chars, that what follows is...
\s*: optional spaces and a colon
then (?:(?!^\S).)* zero or more chars that are not a non-space char at the beginning of a line, lazily matching up to...
Groundhog Day the movie title!

Reference

edited Jun 29 '14 at 11:20

answered Jun 29 '14 at 10:59

zx81

41,100
9
89
105

I tried running it. It did not work. Here is the error message grep: unrecognized character after (? or (?-. I am trying to debug it, but since It is very complex, and I don't known of half the things you have used here, I think I will need your further help. :^D – Max Jun 29 '14 at 11:03
Added tweak and tweak, have a look. :) – zx81 Jun 29 '14 at 11:04
Thanks for your help.But it is definitely not for the faint hearted. – Max Jun 29 '14 at 11:10
Finished the explanation. ` it is definitely not for the faint hearted` You're right, it's far from obvious, but with the explanation I'm sure you'll be able to understand it. Is it working? – zx81 Jun 29 '14 at 11:13
After that explanation, I actually owe you 50-60 reputation at least! :) – Max Jun 29 '14 at 11:25
Nah, it was a real pleasure, you're most welcome! :) If you want to do me (or you) a favor, go learn some more cool regex! :) For instance there are a few interesting questions in the right pane of my profile, the [regex FAQ](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075) is also good, then answers by some of the regex gods here (click on top users of all time in the regex tag), or sites like regular-expressions.info and rexegg... Regex is cool, Dude! :) – zx81 Jun 29 '14 at 11:29

score 3 · Answer 2 · answered Jun 29 '14 at 11:22

3

I would do it with awk if I unserstood the problem right:

 awk -F: -v s="$search" '$1~/\S/{p=$1}$2~s{print $1 FS $2}' file

test with movie 3:

kent$ cat f
Person1 : movie1
          : movie 2
          : movie 3
          : movie 4

in above file, there are leading spaces/tabs

kent$  awk -F: -v s="movie 3" '$1~/\S/{p=$1}$2~s{print p FS $2}' f
Person1 : movie 3

answered Jun 29 '14 at 11:22

Kent

189,393
32
233
301

I created a file just like yours, no leading space in the line with person1: movie1. and I ran the exact command, you gave me. It gave just this, (start of line):movie 3. – Max Jun 29 '14 at 11:29
I am on linux. It is expected to work there, in case you ran it on mac? – Max Jun 29 '14 at 11:39
@Dude I only have linux. I guess because your gawk version is lower than mine, you could try: `awk ... '$1~/[^ \t]/{....}'` – Kent Jun 29 '14 at 12:12
Yup it worked. If you don't mind, could you please explain the regex briefly. – Max Jun 29 '14 at 12:18
1

@Dude the regex is just matching a string ($1, the first column) if it contains any non-empty char. The problem like that is typical for awk. grep is great, but here it is not the right tool for it.(my opinion) – Kent Jun 29 '14 at 12:19

score 2 · Answer 3 · answered Jun 29 '14 at 16:01

2

This might work for you (GNU sed):

sed -n '/^\S/h;/movie 3/{H;x;s/:.*:/:/p}' file

Use the -n switch to provide grep like nature. Save the person in the hold space and append the movie to it. Then remove unwanted text and print out.

answered Jun 29 '14 at 16:01

potong

55,640
6
51
83

score 0 · Answer 4 · answered Jun 29 '14 at 22:35

This is a bit obscure but get the job done:

awk '/^[^ ]/{p=0} /Person1/{p=1} p'

Example:

Input file:

Person1 : movie1
    : movie 2
    : movie 3
    : movie 4
Person2 : movie 5
    : movie 6

Execution:

awk '/^[^ ]/{p=0} /Person1/{p=1} p' file
Person1 : movie1
    : movie 2
    : movie 3
    : movie 4

awk '/^[^ ]/{p=0} /Person2/{p=1} p' file
Person2 : movie 5
    : movie 6

OBS: In the command line the output is indented.

Explanation:

If the line does not start with space, sets p=0
If the line contains Person1 sets p=1
if p=1 then print (This part is obscure)

Can be done in perl too:

perl -ne '/^\w+/ && {$p=0}; /Person1/ && {$p=1}; $p && {print}'

Is this text extraction scenario possible in linux bash shell?

4 Answers4