0

I'm attempting to get content between certain html tags. I have been referring most recently to this question How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)? . I've tried two or three of the suggestions here, and another suggestion from another page. I cannot get any of them to work.

The regex <\s*p(\s+.*?>|>).*?<\s*/\s*p\s*> works inside of an online sed editor, but it doesn't work in my GNU shell.

The pattern sed -n '/PAT1/,/PAT2/{/PAT2/!p}' FILE written as sed -n '/<p>/,/<\/p>/p' FILE seems to fail silently, as it just returns everything in the file.

The pattern awk '/PAT1/{flag=1; next} /PAT2/{flag=0} flag' file in my shell as awk '/<p>/{flag=1; next}/<\/p>/{flag=0} flag' file returns the file without the matches, but it contains the also contains rest of the (non-matching) file.

Andrew
  • 737
  • 2
  • 8
  • 24
  • sed's `/pat1/,/pat2/` only works properly if they are different lines. `\s`, `*?`, `|`, etc are not standard sed syntax but would work in Perl. – jhnc Jan 29 '23 at 05:02
  • 2
    Try to add a minimal failing test case to your question along with the code you tried, actual output, and desired output. – jhnc Jan 29 '23 at 05:04
  • 1
    Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jan 29 '23 at 07:18
  • It is impossible for that regexp to work in any sed, online or otherwise, as it's trying to use PCRE constructs (`.*?`) while sed only supports BRE or ERE. You may get the output you expect for some specific sample input but that doesn't mean it works. – Ed Morton Jan 29 '23 at 16:32
  • Please [edit] your question to replace "pattern" by string-or-regexp, full-or-partial, and word-or-line wherever it occurs and provide a [mcve] containing concise, testable sample input (make sure to include regexp metachars and undesirable substring matches) and expected output so we can help you solve whichever problem you're asking for help with as there is no general solution for all "patterns", see [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) for details. – Ed Morton Jan 29 '23 at 16:36
  • 1
    Perhaps massage the data into an easier structure i.e. place every `

    ` and `

    ` on a separate line and then tackle the problem e.g. `sed -E 's/<\/?p>/\n&\n/g;H;$!d;x;s/(

    \n)\n/\1/g;s/\n(\n<\/p>)/\1/g' file|sed -n '/

    /,/<\/p>/{//!p}`

    – potong Jan 31 '23 at 09:40
  • I'm just capturing text between two strings. If i can match between a characters, why can't i do it with word groups? isn't that what groups are for? though, requiring nested groups, this pattern is admitedly complex. Most people (in my professional opinion incorrectly) believe it not to be normative, but it is more practical to do it in sed on html files where you know the structure. – Andrew Jan 31 '23 at 15:35
  • here is the match in plain language: |1. match the beginning of the html tag '

    ' operator. |3 match all containing text between two groups with '.*' |4. match closing html tag '

    – Andrew Jan 31 '23 at 15:36
  • @potong that looks like an interesting solution, but doesn't fit with spec. =/ – Andrew Jan 31 '23 at 15:37
  • here's what i have so far: sed -e s/(?<=/

    .*(?=/

    /) here's an attempt for another datatype using grep: /(?<=/MHhGRkUw/).*(?=/MHhGRkVG/)/
    – Andrew Jan 31 '23 at 15:40
  • another attempt: 's/<=?/PAT¹/.*,/PAT²/p' – Andrew Jan 31 '23 at 17:52

1 Answers1

1

awk '/<p>/{flag=1; next}/<\/p>/{flag=0} flag' file

This solution assumes <p> and </p> are at own lines, so this will work as expected for e.g.

<p>
This is paragraph
</p>
<i>
This is not paragraph
</i>
<p>
This is another paragraph
</p>

but not

<p>This is paragraph</p><i>This is not paragraph</i><p>This is another paragraph</p>

Note that using regular expression to process HTML is generally bad idea, as HTML is Chomsky Type-2 contraption, whilst first is designed for working Chomsky Type-3 contraptions. Thefore I suggest using hxselect if you are allowed to install tool then you might use it like so

hxselect -i -c -s '\n' 'p' < file

where -i means case means be case-insensitive, -c get just content (i.e. do not include opening and closing tag), -s '\n' shear found items using newline character, p is CSS3 selector describing tag to find (in this case all <p> tags).

Edit: if there is absolutely not newline in your file and there are not nested p tags then you might try using GNU AWK following way

awk 'BEGIN{RS="</?p>"}NR%2==0' file

and then hope it will work as intended.

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • i'd like to pattern match more than just html tags. matching a range bewteen 2 string patterns is also useful for networking and analyzing exif data. if you could please stay on topic and suggest how to find the above on the same line. – Andrew Jan 29 '23 at 13:20
  • interesting theory, but theories themselves aren't proof. i'd rather stick to sed because it's available everywhere and because i'm abstracting it to also take other types dara types as input. – Andrew Jan 29 '23 at 13:26
  • @Andrew I extended my answer with GNU `AWK` which would (probably) work for **certain subset** of legal HTML files – Daweo Jan 29 '23 at 18:35