Html Agility Pack search all nodes and save them

Question

I shall search over whole website entries with "00:00-00:01" and replace with "" , like below.

<td id="tb"> Fr, 3.Sep.2021 00:00-00:01 </td>...<td id="tb"> Fr,3.Sep.2021 </td>

or

<td class="tbda">Fr, 3.Sep.2021 00:00-00:01</td>...<class="tbda">Fr, 3.Sep.2021 </td>

or

<b>Fr, 3.Sep.2021 00:00-00:01</b>...<b>Fr, 3.Sep.2021</b>

A single one is no problem but how can I found all and how can I save the path to this?

score 0 · Answer 1 · answered Jan 27 '22 at 22:05

One way is to use regex:

re.findall(r'<td\s+id="tb">(\w+,\s+\d+\.\w+.2021\s+[0-9:]{4}-[0-9:]{4})</td>',text)

But you want more details, how it was found and where. So find all matched tags first, then find all content between them, then save it with an html tag. Like below:

  <div> 
      <tr> # this is the start tag </tr>
        <td id="tb">Fr, 3.Sep.2021 00:00-00:01</td> # this is the end content </td> # this is the end tag </tr>  
      ... more tr ...   
  </div>

The idea can be found in How to convert an XML file to nice pandas dataframe? .

Html Agility Pack search all nodes and save them

1 Answers1