Print all line between the search pattern into different files using perl or any method

Question

Could someone help out on this I want to print all line between the search pattern (START & END) to different files (new_file_name can be any incremental name provided)

But the search pattern repeats in file hence each time it finds the pattern it should dump the line b/w them into different files

The file is something like this

START --- ./body1/b1
##########################

123body1
abcbody1

##########################
END --- ./body1/b1

START --- ./body2/b2
##########################

123body2
defbody2

##########################
END --- ./body2/b2

Perhaps http://stackoverflow.com/questions/13023595/sed-awk-print-text-between-patterns-spanned-across-multiple-lines?rq=1 might help — , Aug 13 '13 at 05:00

score 1 · Answer 1 · answered Aug 13 '13 at 05:16

Here is my awk solution:

# print_between_patterns.awk
/^START/ { filename = $NF ; next } # On START, use the last field as file name
/^END/ { next }                    # On END, skip
{ print > filename }               # For the rest of the lines, print to file

Assume your data file is called data.txt, the following will do what you want:

awk -f print_between_patterns.awk data.txt

Discussion

After the script ran, you will have ./body1, ./body2, and so on.
If you don't want to skip the BEGIN and END parts, remove the next commands.

Update

If you want to control the output filename in a sequential way:

/^START/ { filename = sprintf("out%04d.txt", ++count) ; next }
/^END/ { next }
{ print > filename }

HI Hai Vu, it worked and creatd file names with ./body2 ./body1 but instead if i have something like ./body1/heat inteasd of ./body1 in the file then in that case the ./body1/heat file doesn't get created. Do we need to add an argument the above script u provieded — James bond, Aug 13 '13 at 05:53

mpapec · Accepted Answer · 2013-08-13T08:10:05.723

1

perl solution,

perl -MFile::Basename -MFile::Path -ne '
  ($a) = /^START.+?(\S+)$/;
  $b = /^END/; 
  $a..$b or next; 
  if ($a){ mkpath(dirname $a); open STDOUT,">",$a; }
  $a||$b or print;
' file

edited Aug 13 '13 at 08:10

answered Aug 13 '13 at 07:03

mpapec

50,217
8
67
127

What if i have " START --- ./body1/b1 " instead of "START --- ./body1 " then how can we dump into "body1_b1" file name – James bond Aug 13 '13 at 07:32
Hi mpapec,it worked, but that was just an examaple hat if i have multiple hier like " START --- ./body1/b1/bb1/bbb1 " How to tackle this or what needs to be changed in the above script each time based on hierarchy – James bond Aug 13 '13 at 08:04
@Jamesbond - I'm just curious: do you understand that perl script? Could you enhance or otherwise modify it in future if you had to? – Ed Morton Aug 13 '13 at 12:39

Ed Morton · Answer 3 · 2013-08-14T12:04:10.913

1

To get automatically generated incremental file names:

awk '
/^END/   { inBlock=0 }
inBlock  { print > outfile }
/^START/ { inBlock=1; outfile = "outfile" ++count }
' file

To use the file names from your input:

awk '
/^END/   { inBlock=0 }
inBlock  { print > outfile }
/^START/ {
    inBlock=1
    outdir = outfile = $NF
    sub(/\/[^\/]+$/,"",outdir)
    system("mkdir -p \"" outdir "\"")
}
' file

The problem @JamesBond was having below was that I wasn't escaping the "/" within the character list in the sub() so I've updated my answer above to do that now. There's absolutely no reason why that should need to be escaped but apparently both nawk and /usr/xpg4/bin/awk require it:

$ cat file
the
quick/brown
dog

$ gawk '/[/]/' file
quick/brown

$ nawk '/[/]/' file
nawk: nonterminated character class [
 source line number 1
 context is
         >>> /[/ <<< ]/

$ /usr/xpg4/bin/awk '/[/]/' file
/usr/xpg4/bin/awk: /[/: [ ] imbalance or syntax error  Context is:
>>>     /[/     <<<

and gawk doesn't care either way:

$ gawk --lint --posix '/[/]/' file
quick/brown

$ gawk --lint '/[/]/' file        
quick/brown

$ gawk --lint --posix '/[\/]/' file
quick/brown

$ gawk --lint '/[\/]/' file        
quick/brown

They all work just fine if I escape the backslash without putting it in a character list:

$ /usr/xpg4/bin/awk '/\//' file    
quick/brown

$ nawk '/\//' file             
quick/brown

$ gawk '/\//' file
quick/brown

So I guess that's something worth remembering for portability in future!

edited Aug 14 '13 at 12:04

answered Aug 13 '13 at 12:30

Ed Morton

188,023
17
78
185

Hello Ed, when using the first script getting this error "awk: syntax error near line 3 awk: bailing out near line 3" when sourced second one getting "Unmatched '. " – James bond Aug 13 '13 at 16:24
See my earlier comment to you in the thread below anubhava's answer about you using old, broken awk. – Ed Morton Aug 13 '13 at 16:28
Yeah i did saw so i there any other way i can install or any other workaround for awk – James bond Aug 13 '13 at 16:31
the first one worked but the second one didn't work. other query on the was cant we have the start/end line also appended within the files – James bond Aug 13 '13 at 16:36
Of course, just re-arrange the 3 blocks of code so the START comes first and the END last. In what way did the second one "not work"? I already told you how to work around your old, broken awk problem. – Ed Morton Aug 13 '13 at 18:00
Yes i did change the awk version as u said it worked for 1st one not for the second as it errored out "Unmatched '. " – James bond Aug 14 '13 at 03:10
Are you SURE you did a copy/paste of the script and didn't try to re-type it yourself? There's no reason for that script to produce an error about unmatched '. copy/paste the window where you execute the script you're running and the error message you get into your original question if you'd like help debugging it. – Ed Morton Aug 14 '13 at 04:56
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/35404/discussion-between-james-bond-and-ed-morton) – James bond Aug 14 '13 at 06:42

anubhava · Answer 4 · 2013-08-13T05:17:03.397

0

Using awk:

awk 'sub(/^START/, ""){out=sprintf("out%d", c++); p=1}
     sub(/^END/, ""){print > out; p=0} p{print > out}' file

This will find and store each match between START and END into separate files named out1, out2 etc.

edited Aug 13 '13 at 05:17

answered Aug 13 '13 at 05:10

anubhava

761,203
64
569
643

Hi Anubhava, it didnt work as expected and errored out awk: syntax error near line 1 awk: bailing out near line 1 – James bond Aug 13 '13 at 05:20
What is your awk version? Run `awk --version` – anubhava Aug 13 '13 at 05:28
Then it should work as I'm not using any awk 4+ specific feature. – anubhava Aug 13 '13 at 05:54
1

@Jamesbond Any time you get that error message it means you are using old, broken awk which infuriatingly is the default awk on Solaris. Use /usr/xpg4/bin/awk or nawk or gawk instead. – Ed Morton Aug 13 '13 at 12:07

konsolebox · Answer 5 · 2013-08-13T06:22:59.893

0

This is one way to do it in Bash.

#!/bin/bash

[ -n "$BASH_VERSION" ] || {
    echo "You need Bash to run this script."
    exit 1
}

shopt -s extglob || {
    echo "Unable to enable extglob shell option."
    exit 1
}

IFS=$' \t\n' ## Use default.

while read KEY DASH FILENAME; do
    if [[ $KEY == START && $DASH == --- && -n $FILENAME ]]; then
        CURRENT_FILENAME=$FILENAME
        DIRNAME=${FILENAME%%+([^/])}
        if [[ -n $DIRNAME ]]; then
            mkdir -p "$DIRNAME" || {
                echo "Unable to create directory $DIRNAME."
                exit 1
            }
        fi
        exec 4>"$CURRENT_FILENAME" || {
            echo "Unable to open $CURRENT_FILENAME for output."
            exit 1
        }
        for (( ;; )); do
            IFS= read -r LINE || {
                echo "End of file reached finding END block of $CURRENT_FILENAME."
                exec 4>&-
                exit 1
            }
            read -r KEY DASH FILENAME <<< "$LINE"
            if [[ $KEY == END && $DASH == --- && $FILENAME == "$CURRENT_FILENAME" ]]; then
                break
            else
                echo "$LINE" >&4
            fi
        done
        exec 4>&-
    fi
done

Make sure you save the script in UNIX file format then run it as bash script.sh < file.

edited Aug 13 '13 at 06:22

answered Aug 13 '13 at 06:16

konsolebox

72,135
12
99
105

A shell is an environment from which to call tools. It has programming language constructs to let you sequence calling those tools. It is not intended to itself be a tool for things like parsing text files - there are other tools for that, e.g. awk, perl, etc. If you doubt that, compare the length and complexity of the above shell script to the scripts written in the tools designed to do this job. – Ed Morton Aug 13 '13 at 12:34
@EdMorton Sorry but I have to disagree with that. That may apply to other shells but not generally on Bash. Each tool including Bash has its strengths and weaknesses and it's anyone's option to make the best use out of it. I actually could have used awk for this, unfortunately it seems that it has also been a requirement by the OP to create a directory if doesn't exist which could be still possible in Awk but could be a dirty solution after all awk can only accept a single argument and reinterpret it for that command which causes syntax errors sometimes. – konsolebox Aug 13 '13 at 12:49
I agree that each tool has it's strengths and weaknesses. Parsing text files is, by design, a strength of awk and a weakness of shell. Manipulating files and processes is, by design, a strength of shell and a weakness of awk. Therefore parse text files with awk and manipulate files and processes with shell. When you need to do both things, just use both tools, each for the appropriate part of the problem. The above shell script is 5 times the size and much more complex than the awk+shell script I posted to do the same job. QED? – Ed Morton Aug 13 '13 at 13:14
In a command like system("echo \"" something "\"). If 'something' turns out to have a value that has a ", it would cause syntax error to the called sh. And that's only the basics of it. There are many possible unseen troubles because of this unlike in Bash where you could do "${VAR[@]}" and certainly the arguments would be passed the way they are whatever they may be. One quick solution is to replace ["`$\] with their quoted forms but that's only for a single argument. How about multiple arguments? If you're using Bash you wouldn't even have to go with the trouble. – konsolebox Aug 13 '13 at 13:26
You can't simply replace chars with their quoted form as they may already be quoted in the input. If you actually had problematic characters in your file names then you'd still use awk to parse the input files but pipe the awk output to shell to create the output files/dirs rather than simply calling shell via system() or otherwise handle it with an appropriate mix of shell and awk. Either way, it does not make sense to parse the input file with bash. You did remind me to quote the directory name in my script though so thanks for that! – Ed Morton Aug 13 '13 at 13:48
Piping the output to the shell would still be the same and you're already using two process just to make one task work. Anyway we have our ways and opinions. If you think bash musn't be used with text parsing then it's up to you. But clearly awk has a flaw and care is really needed when passing arguments to an external command. By the way can you give an example source for output where chars in a string to be passed as an argument is already quoted? Cause I don't think there is, and awk doesnt quote it by default. With user data probably although it's a very uncommon practice. – konsolebox Aug 13 '13 at 14:04
User data is what I'm thinking of and yes, it's uncommon but so is having quotation marks, etc. in a file name. Yes, one more process will be used instead of writing more than (since without more code the above shell script doesn't handle these edge case either) 5 times the code which is absolutely worth the tradeoff. We'll just have to agree to disagree and people reading this can form their own opinion of the right approach. – Ed Morton Aug 13 '13 at 14:17
By the way, piping the output to shell would not be the same as invoking shell from awk via system() since awk would print the file names as-is to stdout and then shell can read/use them exactly as if it was reading them from the input file so shell is dealing with the characters that are problematic to shell, not awk. – Ed Morton Aug 13 '13 at 14:29
If the shell would do a trick reading the arguments line by line the read command, it won't have a problem but if the shell would read it as a script then there would be the same problem. Commands even with other languages like C always call other binaries with separate argument strings, so it's wrong for awk to call a process with arguments separated from one string only. And trying to make the shell 'read' the arguments by lines is already a hack. e.g. Trick: print dirname | "bash -c \"read && mkdir -p \\\"$REPLY\\\"\"" – konsolebox Aug 13 '13 at 14:40

Vijay · Answer 6 · 2013-08-13T07:12:14.027

0

I guess you need to see this.

perl -lne 'print if((/START/../END/) and ($_!~/START/ and $_!~/END/))' your_file

Tested below:

> cat temp
START --- ./body1
##########################

123body1
abcbody1

##########################
END --- ./body1

START --- ./body2
##########################

123body2
defbody2

##########################
END --- ./body2
> perl -lne 'print if((/START/../END/) and ($_!~/START/ and $_!~/END/))' temp
##########################

123body1
abcbody1

##########################
##########################

123body2
defbody2

##########################
>

edited Aug 13 '13 at 07:12

answered Aug 13 '13 at 07:06

Vijay

65,327
90
227
319

Thanks it worked but how can we dump them into different files each – James bond Aug 13 '13 at 07:31
just a side note, `$_!~/START/` => `!/START/` – mpapec Aug 13 '13 at 07:54

score 0 · Answer 7 · answered Aug 13 '13 at 07:24

0

This might work for you:

csplit -z file '/^START/' '{*}'

Files will be named xx00 xx01 xx..

answered Aug 13 '13 at 07:24

potong

55,640
6
51
83

What if i have " START --- ./body1/b1 " instead of "START --- ./body1 " – James bond Aug 13 '13 at 07:27

Print all line between the search pattern into different files using perl or any method

7 Answers7

Discussion

Update