0

Using this command:

sed -n '/<article class.*article--nyheter/,/<\/article>/p' news2.html > onlyArticles.html 

I get all these articles tags in my html document. They are about 50+ articles.

Sample input:

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

I just want x number of articles. Like just top 2 articles.

Output:

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

This is just an example. What I am trying to achieve is to select only (x) number of matching nodes.

Is there any way to do it? Cannot just use simple head or tail as I need to extract the matching elements not just some x amount of lines.

tripleee
  • 175,061
  • 34
  • 275
  • 318
mohsinali1317
  • 4,255
  • 9
  • 46
  • 85

2 Answers2

3

xmllint + xpath can be used requesting tags by position

xmllint --html --recover --xpath '//article[position()<=2]' tmp.html 2>/dev/null
LMC
  • 10,453
  • 2
  • 27
  • 52
0

This might work for you (GNU sed):

sed -En '/<article/{:a;p;n;/<\/article>/!ba;p;x;s/^/x/;/x{2}/{x;q};x}' file

Turn off implicit printing and on extended regexp -En.

Match and print lines between <article and <\article> then increment a counter in the hold space and quit processing if the number of occurrences is completed.

Alternative:

cat <<\! | sed -Enf - file
/<article/{
:a
p
n
/<\/article>/!ba
p            
x
s/^/x/
/x{2}/{
x     
q     
}
x
}
!
potong
  • 55,640
  • 6
  • 51
  • 83