-1

I want to extract each html table from a list of links. The code I use is the following:

wget -O - "https://example.com/section-1/table-name/financial-data/" | xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null >> /Applications/parser/output.txt

This works perfectly fine, however, given that this is not the only table I want to extract it will give me difficulties identifying which financial-data belongs to which table. In this case scenario, it will only parse one table that is appended to that output file where the SDTOUT looks like this:

<tbody>

                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                    </tr>
...
</tbody>

But I am looking for this:

<tbody>

                    <tr class="text-right">
                      <td>TABLE-NAME</td>
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                    </tr>

                    <tr class="text-right">
                      <td>TABLE-NAME</td>
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                    </tr>
...
</tbody>

Where the TABLE-NAME is the name of the specific asset. The name can be extracted either using the XPath /html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text() which appears in the same URL where the table is, or from the link itself /table-name/.

I cannot figure out the syntax.

NB: I purposely omitted the -q flag in the wget command as I want to see what is happening in the Terminal at the moment the script is executed.

Thanks!


UPDATE

According to @DanielHaley this can be done through XMLStarlet, however, when I read through the documentation I could not find an example of how to use it.

What is the correct syntax? Do I first have to parse the HTML table via xmllint --html --xpath and then apply xmlstarlet afterwards?

This is what I've found so far:

-i or --insert <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-a or --append <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>

NEW UPDATE

According to this link, I came across the script that adds a subnode easily like this:

wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "Hello World" >> /Applications/parser/output.txt

Which writes the following to STDOUT:

<tbody>

                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                      <td>Hello World</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                      <td>Hello World</td>
                    </tr>
...
</tbody>

So far so good, however, this reproduces some default text declared as a text string using the option -v, i.e. in this case scenario "Hello World". I'm hoping to replace this text string with the actual name of the asset. As stated previously, the TABLE-NAME is found in the same page where the table is and can be accessed via the other XPath, hence I tried the following code:

wget -O - "https://example.com/section-1/table-name/financial-data/" |
header=$(xmllint --html --xpath '/html/body/div[3]/div/div[1]/div[3]/div[1]/h1' -) |
xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt

Here you can clearly see that I tried declaring a variable $header that shall include the name of the asset. This does not work and leaves my output file empty, probably because the declaration is wrong or the pipe's syntax is not correct.

How can I insert the according XPath (that references to the name of the asset) into the newly created subnode <td>? A variable is the first thing that I came up with; can it be done elsewise?

Ava Barbilla
  • 968
  • 2
  • 18
  • 37
  • You are redirecting to `>> /Applications/parser/output.txt`. Is it a solution for you to redirect to `>> /Applications/parser/table-name.txt` ? – Walter A Sep 09 '17 at 20:03
  • Hi @WalterA I don't want to change the name of the file itself, instead, given that the tables are appended to that one file named **output.txt**, I want to add some sort of heading to each table inside of that file in order to identify each of them. The wget command will append approx. over 1160 different tables to one file with just the table information, which is why it would come handy to know which table belongs to which dataset. I proposed using the **TABLE-NAME** as a new row like `TABLE-NAME`. Can you help me with this? – Ava Barbilla Sep 10 '17 at 09:46
  • 1
    Are you able to use xmlstarlet instead of xmllint? – Daniel Haley Sep 11 '17 at 03:33
  • Hi @DanielHaley I managed to install xmlstarlet via Homebrew, so yes I am able to use it instead. I'm having a look at the documentation and must say I'm kind of new to this. What do you have in mind? Maybe you could provide an answer; it would really help me out! – Ava Barbilla Sep 11 '17 at 16:17
  • According to the [documentation](http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html), it says there is a command `-a` or `--append` but it does not provide an example. I assume that it might be something similar to `-d` or `-i`. Will I first have to parse out the table and then apply the `-a` command to append the **TABLE-NAME** at the end each ``? I don't mind it being at the beginning or at the end as long as each row includes the name. On the other hand, how can I parse multiple XPaths? I.e. one XPath is for the table and the other is for the name which are in the same webpage. – Ava Barbilla Sep 11 '17 at 17:11

3 Answers3

1

You should try to insert the additional column before appending the output to output.txt. Make sure the tablename you need is stored in a variable. You want to do something like

tbl=testtbl
echo "<tbody>
                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                    </tr>
" | sed 's#.*<tr.*#&\n      <td>'"${tbl}"'</td>#'

In the sed command the normal slashes are replaced by '#', so you do not to escape the slash in </td>.
When you have a file alltables.txt with the apporox. 1160 tables, you van make a loop like this:

while IFS= read -r tbl; do
   wget -O - "https://example.com/section-1/table-name/financial-data/" |
      xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
      sed 's#.*<tr.*#&\n      <td>'"${tbl}"'</td>#' >> /Applications/parser/output.txt
done < alltables.txt
Walter A
  • 19,067
  • 2
  • 23
  • 43
  • Hi @WalterA Thanks for your answer :) I was looking for something shorter and perhaps simpler like `xmlstarlet`. Are you familiar with it? – Ava Barbilla Sep 12 '17 at 02:02
  • You will need a variable `tbl` and some loop for all your solutions.I do not know `xmlstarlet`. – Walter A Sep 12 '17 at 20:03
  • Do you know, for instance, how to pass the `xmllint --html --xpath '/html/body/div[3]/div/div[1]/div[3]/div[1]/h1'` to a variable? Or do you suggest another approach which can be used via the command `-v`? – Ava Barbilla Sep 12 '17 at 22:31
1

You could probably do this with the ed (edit) command in xmlstarlet, but I don't know xmlstarlet well enough to give you an easy answer.

Also, like you said, it looks like you'd have to pass the HTML through either xmllint or use the fo xmlstarlet command before passing it to xmlstarlet ed. It doesn't look like ed supports --html.

What I would do is use the xmlstarlet tr (transform) command with an XSLT stylesheet.

It's very verbose, but it's much safer than trying to parse HTML/XML with regex. It's also a lot easier to extend.

Here's the XSLT. I added comments to try to help you understand what's happening.

XSLT 1.0 (stylesheet.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" omit-xml-declaration="yes"/>
  <xsl:strip-space elements="*"/>

  <!--Parameter to capture the table name. This is set on the command line.-->
  <xsl:param name="tablename"/>

  <!--Identity transform. Will basically output attributes/nodes without 
  change if not matched by a more specific template.-->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!--Template matching the root element. I do this to narrow the scope of what's 
  being processed.-->
  <xsl:template match="/*">
    <!--Process tbody.-->
    <xsl:apply-templates select=".//*[@id='financial-data']/div/table/tbody"/>
  </xsl:template>

  <!--Match tr elements so we can add the new td with the table name.-->
  <xsl:template match="tr">
    <!--Output the tr element.-->
    <xsl:copy>
      <!--Process any attributes.-->
      <xsl:apply-templates select="@*"/>
      <!--Create new td element.-->
      <td><xsl:value-of select="$tablename"/></td>
      <!--Process any children of tr.-->
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

command line

wget -O - "https://example.com/section-1/table-name/financial-data/" | 
xml tr --html stylesheet.xsl -p tablename="/html/body/div[3]/div/div[1]/div[3]/div[1]/h1"

I was able to test this locally by using cat on a local html file instead of wget. Let me know if you want me to add the test file/result to my answer.

Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
  • Hi @DanielHaley Thanks for your support. I added what I've found so far to my question that brings me really close to my result. Could you update your answer as to how the replacement should be done? I will gladly accept your solution if you can help me accomplish this final step! Best regards :) – Ava Barbilla Sep 12 '17 at 21:59
  • Any news about the script? – Ava Barbilla Sep 13 '17 at 19:33
0

This script works but is inefficient; it needs some editing:

name_query="html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text()"

# Use xargs to TRIM result.
header=$(wget -O - "https://example.com/section-1/name-1/financial-data/" |
    xmllint --html --xpath "$name_query" - 2>/dev/null |
    xargs)

wget -O - "https://example.com/section-1/name-1/financial-data/" |
    xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
    xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt

This makes two requests:

  1. Fetch the name and pass it to variable $header
  2. Get the table and append a subnode <td>$header</td>

Hence, this writes the following to my output.txt file:

<tbody>

                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                      <td>Name 1</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                      <td>Name 1</td>
                    </tr>
...
</tbody>

It's relatively slow because this can actually be done using one request only, but I can't figure out how.

Ava Barbilla
  • 968
  • 2
  • 18
  • 37