-1

I would like to parse data from websites through Bash using sed or awk (feel free to change my direction to process data).

Here is a sample of code.

<tbody>
        <tr>
            <td class="text-left">111</td><td class="text-center">
                <a href="/path1.htm">AAA</a>
            </td><td class="text-center">
                <a href="/path2.htm" class="tp-link-policy">BBB</a>
            </td><td class="text-center">
                Updated October, 2016
            </td>
        </tr><tr>
            <td class="text-left">CCC</td><td class="text-center">
                <a href="/path3.htm">
            .
            .
            .
            .
        </tr>
</tbody>

Usually when I used preg_match in PHP I had no problem with newlines, but in Bash I need to change my thinking of regex completely. Do you recommend to first prepare data to be readable by sed and awk to delete all newlines and recreate them again different way based on what structure of data I desire to use sed or awk?

For example, I would create a newline for every <tr> so the result would look like this? Am I right or I should leave this way of thinking? This would work, but I do not feel comfortable to manipulate data like this.

<tbody>
<tr><td class="text-left">111</td><td class="text-center"><a href="/path1.htm">AAA</a></td><td class="text-center"><a href="/path2.htm" class="tp-link-policy">BBB</a></td><td class="text-center">Updated October, 2016</td></tr>
<tr><td class="text-left">CCC</td><td class="text-center"><a href="/path3.htm">....</tr></tbody>

Output should be, for example:

111|AAA|BBB|Updated October, 2016
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Pavol Travnik
  • 853
  • 3
  • 14
  • 32

1 Answers1

0

I have used xmllint after all.

xmllint --html --shell <file>

Then I executed this command to retrieve a demanded xpath.

grep <text>

When you find a structure in your html file you can fully search through your file based on xpath.

xmllint --html --xpath <xpath> <file>

However much efficient is to use Python and Beautiful Soup.

Pavol Travnik
  • 853
  • 3
  • 14
  • 32