1

I want to write a bash script that can identify a tag within an text that matches a multi-line pattern so that I can use the identifying tag to further process the nested tags for later processing. I've searched through multiple questions but they all seem to be falling short in some way or another, making it difficult to progress. What I have been successful at is being able to match the patterns and get the matched lines but however it comes out as a single output (I believe). First here is the sample text file I am testing with.

random words to put here: dresser car street space 
*
********************************************************************************
********************************************************************************
-->
         interested data: name="someFile_1.txt"random data
      endMultilinePattern
   <!--****************Random comment***************-->
      startMultilinePattern id="someFileTag_2"
         interested data: name="someFile_2.txt"random data
      endMultilinePattern
   <!--****************Random comment***************-->
      startMultilinePattern id="someFileTag_3"
        interested data: name="someFile_3.txt"random data
      endMultilinePattern      
   some random data body
      some random nested data filepath="/" uuid="randomcharacters"random data
   some random data body
 more random data
 endMultilinePattern
      startMultilinePattern id="someFileTag_2"
         interested data: name="error_someFileTag_2.txt"random data
      endMultilinePattern
   <!--****************Random comment***************-->

Here are some outputs I've gotten and the answers that led to them. Perhaps through poor understanding of my own, I may not know how to use the commands properly. First of all, the id I am interested in is in startMultilinePattern id="someFileTag_2">, I will use id later down in the file to match other tags that use that id. Secondly, I will want to grab the attribute name in interested data: name="..."random data tag in order to search that file in the filesystem for further processing. In this question, all I want to do right now is get startMultilinePattern> ... multi-line match ... endMultilinePattern and then grab the file name within the interested data: name="..."random data tag. Here we go:

The following makes use of the -P option in grep for perl, although it gets the proper output, I can't seem to read into an array and output each mult-line match.
Src: grep (bash) multi-line pattern

$ $ grep -Pzon "((startMultilinePattern )(.|\n)*?(endMultilinePattern))" test.txt | while read -a grepOut; do POS=$((POS+1)) && echo "0=${grepOut[0]}, 1=${grepOut[1]}, 2=${grepOut[2]}, 3=${grepOut[3]}}";done                                                               0=1:startMultilinePattern, 1=id="someFileTag_2", 2=, 3=}
0=interested, 1=data:, 2=name="someFile_2.txt"random, 3=data}
0=endMultilinePattern1:startMultilinePattern, 1=id="someFileTag_3", 2=, 3=}
0=interested, 1=data:, 2=name="someFile_3.txt"random, 3=data}
0=endMultilinePattern1:startMultilinePattern, 1=id="someFileTag_2", 2=, 3=}
0=interested, 1=data:, 2=name="error_someFileTag_2.txt"random, 3=data}

# grep command by itself provides the following output: 
1:startMultilinePattern id="someFileTag_2"
         interested data: name="someFile_2.txt"random data
      endMultilinePattern1:startMultilinePattern id="someFileTag_3"
        interested data: name="someFile_3.txt"random data
      endMultilinePattern1:startMultilinePattern id="someFileTag_2"
         interested data: name="error_someFileTag_2.txt"random data
      endMultilinePattern

Using sed which should be more suitable presumably, I found this interesting answer but I have not been able to make it work. It uses some funky start keywords I don't understand. Src: https://unix.stackexchange.com/questions/112132/how-can-i-grep-patterns-across-multiple-lines

sed -n '/\startMultilinePattern /{:start /endMultilinePattern/!{N;b start};/startMultilinePattern .*\n.*\n.*endMultilinePattern/p}' test.txt

Additionally, the following sed command supposedly works as its on numerous answers but perhaps its old functionality. I can't get it to work as the output doesn't seem as intended. It includes part of the text I DON'T WANT i.e., <some random data body ..... Src: https://unix.stackexchange.com/a/112134/388443

$ sed -e '/startMultilinePattern /,/endMultilinePattern/!d' test.txt
      startMultilinePattern id="someFileTag_2"
         interested data: name="someFile_2.txt"random data
      endMultilinePattern
      startMultilinePattern id="someFileTag_3"
        interested data: name="someFile_3.txt"random data
      endMultilinePattern
      startMultilinePattern id="someFileTag_2"
         interested data: name="error_someFileTag_2.txt"random data
      endMultilinePattern

There are other answers with their own way of doing. Some use awk, I don't know awk so didn't try and also I cannot use pcregrep because I don't have root permissions to install it. From what I understand, grep -P is pcregrep equivalent more or less. Ideas?

LeanMan
  • 474
  • 1
  • 4
  • 18
  • [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus May 21 '20 at 06:13
  • Thanks I'll check it out but XML isn't a requirement just the specific context of the question. The filetype and XML tags are circumstantial rather than my only input files. I would treat the file as a general text therefore am looking for a general solution of regex regardless of file type. I will likely want to reuse this solution for other types of situation as I work through it. – LeanMan May 21 '20 at 06:17
  • To reiterate what I am looking for is: A method that can match a multiline pattern in which I can take each occurrence of the pattern within a given file to extract further data. Looking at the post, its quite amusing but later down it talks of complexity and XML complexity > regex complexity which is why you on't use regex for XML. I understand that but I believe my input file is too simplified and controlled to be too complex for regex. See: https://stackoverflow.com/a/1758162/10421103 – LeanMan May 21 '20 at 06:30

1 Answers1

1

Would you please try the following:

str="$(<"test.txt")"            # slurps all the file in a variable str
pattern='startMultilinePattern id="([^"]+)"[[:space:]]+interested data: name="([^"]+)"(.*)'
while [[ $str =~ $pattern ]]; do
    echo "${BASH_REMATCH[1]}"   # prints the id
    echo "${BASH_REMATCH[2]}"   # prints the filename
    str="${BASH_REMATCH[3]}"    # updates the variable str with the remaining substring
done

Output with the provided example:

someFileTag_2
someFile_2.txt
someFileTag_3
someFile_3.txt
someFileTag_2
error_someFileTag_2.txt

You can store the ids and the filenames in arrays or an associative array for further usage.

[Explanation]

  • It first reads the whole file into a variable str including newline characters to enable multiline pattern matching.
  • The variable pattern is a regular expression to match a substring which starts with startMultilinePattern followed by the id, whitespaces including newline character, interested data and the name assigning the shell variable ${BASH_REMATCH[@]} to id, name and the remaining substring after the match.
  • The expression $str =~ $pattern tests the string $str to match the regex $pattern. It scans over the entire file till the end with a help of the while loop.
  • If the provided example is simplified and your actual file has a variation in the tags, you may need to tweak pattern accordingly.
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • Sure I'll try it out, what does $str =~ $pattern do? – LeanMan May 21 '20 at 08:29
  • Yes it does as you state, I'll have to do some more testing to see if it provides the output in a programmatic usable way that serves for my intended purpose. Thank you for your answer! – LeanMan May 21 '20 at 09:39
  • @LeanMan thank you for the response. I've added an explanation in my answer. Hope it will help. – tshiono May 21 '20 at 10:53