0

I have some very long text file ( > 300-500 MB) and with thousands of lines, like:

blavbl
[code]
sdasdasd
asdasd
...
[/code]

line X
line Y
etc
...

[code]
...
[/code]

blabla

[code]

[/code]

I want to split the text in pieces that contains string between [code] and [/code], I have the following code that does job (partially) but is very slow:

#!/bin/bash

function split {
        file="$1"
        start="$2"
        end="$3"

        nfodata=$(cat "$file")
        IFS=$'\n' read -d '' -a nfoarray <<< "$nfodata"

        arr=()
        start=0

        for line in "${nfoarray[@]}"
        do
                if [[ "$line" =~ ^"$start" ]]; then
                        arr+=("$line")
                        start=1
                        continue
                fi

                if [[ "$line" =~ ^"$end" ]]; then
                        start=0
                        break
                fi

                if [[ $start == 1 ]]; then
                        arr+=("$line")
                        continue
                fi
        done

        printf "%s\n" "${arr[@]}"
}

split $myfile "[code]" "[/code]"

As I wrote, is very slow, and don't know if is better or faster approach.

The final result want to be an array that contains portion of string between [code] and [/code]

Snake Eyes
  • 16,287
  • 34
  • 113
  • 221
  • 2
    `faster approach` In steps: step 1. do not use Bash. Step 2. Use something else. Step 3: Use AWK or Python or Perl or literally anything else. – KamilCuk May 26 '22 at 10:33
  • Please, post the related expected output. Don't post it as a comment, an image, a table or a link to an off-site service but use text and include it to your original question. Thanks. – James Brown May 26 '22 at 10:40
  • 'sed` or 'awk` are well suited to the task. See https://stackoverflow.com/questions/38972736/how-to-print-lines-between-two-patterns-inclusive-or-exclusive-in-sed-awk-or. – fpmurphy May 26 '22 at 11:01

1 Answers1

0

Using sed:

sed '/^\[code\]$/,/^\[\/code\]$/!d;//d'

Using awk:

awk  '
/^\[\/code\]$/ {--c} c
/^\[code\]$/ {++c}'

Either of these methods require the tag patterns to alternate cleanly - no nested, repeated or unclosed tags.

This prints all lines inside the tags, excluding the tags. Eg:

sdasdasd
asdasd
...
...
<empty line>
dan
  • 4,846
  • 6
  • 15