only select n number of matched lines from HTML file using bash

Question

Using this command:

sed -n '/<article class.*article--nyheter/,/<\/article>/p' news2.html > onlyArticles.html

I get all these articles tags in my html document. They are about 50+ articles.

Sample input:

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

I just want x number of articles. Like just top 2 articles.

Output:

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

This is just an example. What I am trying to achieve is to select only (x) number of matching nodes.

Is there any way to do it? Cannot just use simple head or tail as I need to extract the matching elements not just some x amount of lines.

Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Nov 05 '21 at 21:12

LMC · Accepted Answer · 2021-11-05T21:58:37.487

3

xmllint + xpath can be used requesting tags by position

xmllint --html --recover --xpath '//article[position()<=2]' tmp.html 2>/dev/null

edited Nov 05 '21 at 21:58

answered Nov 05 '21 at 21:49

LMC

10,453
2
27
52

2

Should probably add the `--html` option. – Shawn Nov 05 '21 at 21:51
1

I get a lot of these errors HTML parser error : htmlParseEntityRef: expecting ';' when I run the above commans – mohsinali1317 Nov 05 '21 at 21:55
Added option suggested by @Shawn and `--recover` option that will try to overcome html inconsistencies. – LMC Nov 05 '21 at 21:59
The only issue is that encoding of text is wrong. – mohsinali1317 Nov 05 '21 at 22:48
do you have a sample? – LMC Nov 05 '21 at 22:58
Source file has this Flere blir 100 år – Tordis (102) tror hun vet noe av oppskriften and output file has this Flere blir 100 Ã¥r â Tordis (102) tror hun vet noe av oppskriften – mohsinali1317 Nov 06 '21 at 10:03
Try `echo 'cat //article[position()<=2]' | xmllint --shell tmp.html 2>/dev/null` – LMC Nov 06 '21 at 15:51

potong · Answer 2 · 2021-11-06T11:02:35.670

0

This might work for you (GNU sed):

sed -En '/<article/{:a;p;n;/<\/article>/!ba;p;x;s/^/x/;/x{2}/{x;q};x}' file

Turn off implicit printing and on extended regexp -En.

Match and print lines between <article and <\article> then increment a counter in the hold space and quit processing if the number of occurrences is completed.

Alternative:

cat <<\! | sed -Enf - file
/<article/{
:a
p
n
/<\/article>/!ba
p            
x
s/^/x/
/x{2}/{
x     
q     
}
x
}
!

edited Nov 06 '21 at 11:02

answered Nov 06 '21 at 09:44

potong

55,640
6
51
83

I get this "/
– mohsinali1317 Nov 06 '21 at 10:02
@mohsinali1317 are you using GNU sed and also enclosing the commands within single quotes? – potong Nov 06 '21 at 10:24
No I am not using GNU sed. – mohsinali1317 Nov 06 '21 at 10:26
are you sure that it has to be in this format? – mohsinali1317 Nov 06 '21 at 11:17
I get this sed: -: No such file or directory – mohsinali1317 Nov 06 '21 at 11:20

only select n number of matched lines from HTML file using bash

2 Answers2