Awk multiline non-greedy matching workaround

Question

I'm trying to extract the contents of an HTML list using awk. Some list entries are multi-line.

Example input list:

<ul>
    <li>
        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum</li>
</ul>

Command I'm using:

awk -v RS="" '{match($0, /<li>(.+)<\/li>/, entry); print entry[1]}' file.html

Current output:

        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum

Desired output:

        <b>2021-07-21:</b> Lorem ipsum 
        <b>2021-07-19:</b> Lorem ipsum 
    <b>2021-07-10:</b> Lorem ipsum

I know the issue is because the list entries are not separated by empty lines. I thought of using non-greedy matching, but apparently Awk doesn't support it. Is there a possible workaround?

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jul 24 '21 at 14:53
@Cyrus This is small part of a huge awk. Adding a dependency would be undesirable. — S9oXavyF, Jul 24 '21 at 14:56

Ed Morton · Accepted Answer · 2021-07-24T15:22:46.793

8

With GNU awk for multi-char RS and \s shorthand for [[:space:]]:

$ awk -v RS='\\s*</?li>\\s*' '!(NR%2)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum

I assume you either don't really want the leading white space shown in the Expected Output in your question or you don't care if it's present or not.

edited Jul 24 '21 at 15:22

answered Jul 24 '21 at 15:17

Ed Morton

188,023
17
78
185

RavinderSingh13 · Answer 2 · 2021-07-24T15:33:37.113

With your shown samples, please try following awk code. Written and tested in GNU awk.

awk -v RS='</li>' '
match($0,/<li>.*/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)
  print val
}
' Input_file

Explanation: Adding detailed explanation for above.

awk -v RS='</li>' '              ##Starting awk program from here and setting RS as </li> here.
match($0,/<li>.*/){              ##Matching <li> till end of line here.
  val=substr($0,RSTART,RLENGTH)  ##Creating val which has matched regex value here.
  gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)  ##Globally substituting <li> followed by 0 or more new lines followed by 0 or more spaces OR substituting ending new lines or spaces with NULL in val.
  print val                      ##Printing val here.
}
' Input_file                     ##Mentioning Input_file name here.

score 1 · Answer 3 · answered Jul 24 '21 at 19:13

1

Well here is a Perl:

perl -0777 -nE 'say $1  while(/<li>\s*([\s\S]*?)\s*<\/li>/g)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum

answered Jul 24 '21 at 19:13

dawg

98,345
23
131
206

Awk multiline non-greedy matching workaround

3 Answers3