4

I'm trying to extract the contents of an HTML list using awk. Some list entries are multi-line.

Example input list:

<ul>
    <li>
        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum</li>
</ul>

Command I'm using:

awk -v RS="" '{match($0, /<li>(.+)<\/li>/, entry); print entry[1]}' file.html

Current output:

        <b>2021-07-21:</b> Lorem ipsum 
    </li>
    <li>
        <b>2021-07-19:</b> Lorem ipsum 
    </li>
    <li><b>2021-07-10:</b> Lorem ipsum

Desired output:

        <b>2021-07-21:</b> Lorem ipsum 
        <b>2021-07-19:</b> Lorem ipsum 
    <b>2021-07-10:</b> Lorem ipsum

I know the issue is because the list entries are not separated by empty lines. I thought of using non-greedy matching, but apparently Awk doesn't support it. Is there a possible workaround?

S9oXavyF
  • 105
  • 4
  • 4
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jul 24 '21 at 14:53
  • @Cyrus This is small part of a huge awk. Adding a dependency would be undesirable. – S9oXavyF Jul 24 '21 at 14:56

3 Answers3

8

With GNU awk for multi-char RS and \s shorthand for [[:space:]]:

$ awk -v RS='\\s*</?li>\\s*' '!(NR%2)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum

I assume you either don't really want the leading white space shown in the Expected Output in your question or you don't care if it's present or not.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
5

With your shown samples, please try following awk code. Written and tested in GNU awk.

awk -v RS='</li>' '
match($0,/<li>.*/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)
  print val
}
' Input_file

Explanation: Adding detailed explanation for above.

awk -v RS='</li>' '              ##Starting awk program from here and setting RS as </li> here.
match($0,/<li>.*/){              ##Matching <li> till end of line here.
  val=substr($0,RSTART,RLENGTH)  ##Creating val which has matched regex value here.
  gsub(/<li>\n*[[:space:]]*|\n*[[:space:]]*$/,"",val)  ##Globally substituting <li> followed by 0 or more new lines followed by 0 or more spaces OR substituting ending new lines or spaces with NULL in val.
  print val                      ##Printing val here.
}
' Input_file                     ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

Well here is a Perl:

perl -0777 -nE 'say $1  while(/<li>\s*([\s\S]*?)\s*<\/li>/g)' file
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum
dawg
  • 98,345
  • 23
  • 131
  • 206