-1

I'm trying to extract some info from a website. The response I'm getting from the http request is

{"table_html": "\n
<div class='index-currency-table'>\n
    <!--http://1000hz.github.io/bootstrap-validator/#validator-usage-->\n
    <div class=\"row\">\n    
        <div class=\"col-xs-12\">\n        
            <table class=\"table--exchange table--exchange--responsive\">\n            
                <thead>\n                
                    <tr>\n                    
                        <th scope=\"col\">Currency</th>\n
                        <th scope=\"col\">Nominal</th>\n
                        <th scope=\"col\">The bank buys</th>\n
                        <th scope=\"col\">The bank sells</th>\n
                        <th scope=\"col\">BNB</th>\n
                    </tr>\n
                </thead>\n
                <tbody>\n                \n                    
                    <tr>\n                        
                        <td data-table-header=\"Currency\">\n                          
                            <a href=\"/en/rates-indexes/currency-rates/USD/\" target=\"_self\" title=\"United States Dollar\">
                                <span class=\"flag-icon flag-icon-us\"></span> USD
                            </a>\n
                        </td>\n
                        <td data-table-header=\"Nominal\">1</td>\n                        \n
                        <td data-table-header=\"The bank buys\">1.581200</td>\n
                        <td data-table-header=\"The bank sells\">1.646100</td>\n                        \n
                        <td data-table-header=\"BNB\">1.614390</td>\n
                    </tr>\n                \n
                </tbody>\n
            </table>\n
        </div>
        <!--col-->\n
    </div>
    <!--row-->\n
</div>\n\n"}

I want to get the buy and sell rate values (1.581200, 1.646100). Having in mind that the HTML is represented as what would be the best approach here? For me regex appears to be the simplest solution however I don't think its the best. Is there a way to parse the string back to HTML or convert the whole thing to proper JSON?

var regex = /[\d|,|.\+]+/g;

var string = "result.table_html";
var matches = string.match(regex);  
retro
  • 51
  • 8

2 Answers2

1

I am confused about your sample input. It looks like JSON, but it's not. For this example I tweaked it to be valid JSON.

Best to use an HTML parser. You did not specify where the JavaScript is running. Here is an example for JavaScript running in the browser:

let input = '{"table_html": "\\n<div class=\'index-currency-table\'>\\n    <!--http://1000hz.github.io/bootstrap-validator/#validator-usage-->\\n    <div class=\\"row\\">\\n            <div class=\\"col-xs-12\\">\\n                    <table class=\\"table--exchange table--exchange--responsive\\">\\n                            <thead>\\n                                    <tr>\\n                                            <th scope=\\"col\\">Currency</th>\\n                        <th scope=\\"col\\">Nominal</th>\\n                        <th scope=\\"col\\">The bank buys</th>\\n                        <th scope=\\"col\\">The bank sells</th>\\n                        <th scope=\\"col\\">BNB</th>\\n                    </tr>\\n                </thead>\\n                <tbody>\\n                \\n                                        <tr>\\n                                                <td data-table-header=\\"Currency\\">\\n                                                      <a href=\\"/en/rates-indexes/currency-rates/USD/\\" target=\\"_self\\" title=\\"United States Dollar\\">                                <span class=\\"flag-icon flag-icon-us\\"></span> USD                            </a>\\n                        </td>\\n                        <td data-table-header=\\"Nominal\\">1</td>\\n                        \\n                        <td data-table-header=\\"The bank buys\\">1.581200</td>\\n                        <td data-table-header=\\"The bank sells\\">1.646100</td>\\n                        \\n                        <td data-table-header=\\"BNB\\">1.614390</td>\\n                    </tr>\\n                \\n                </tbody>\\n            </table>\\n        </div>        <!--col-->\\n    </div>    <!--row-->\\n</div>\\n\\n"}';

try {
  // parse string input to an object:
  let json = JSON.parse(input);
  // create an empty DOM element:
  let el = document.createElement( 'html' );
  // add json.table_html string to element:
  el.innerHTML = json.table_html;
  // select the "buys" `td` by data name:
  let buys = el.querySelector('td[data-table-header="The bank buys"]').innerHTML;
  // ditto for "sells" `td`:
  let sells = el.querySelector('td[data-table-header="The bank sells"]').innerHTML;
  console.log('buys: ' + buys);
  console.log('sells: ' + sells);
} catch(e) {
  console.log(e);
}

Output:

buys: 1.581200
sells: 1.646100

If your JavaScript runs in node.js you can use a different HTML parser, such as https://www.npmjs.com/package/node-html-parser

Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
-1

You could try to add the characters around the numbers you are looking for to the regex:

/<td data-table-header="The bank [^"]+">([\d,\.\+]+)<\/td>/g

Since the pattern for the numbers is wrapped with ( ) they will be stored in the first capture group.

You will have to use matchAll() to access the capture groups:

var matches = Array.from(
  string.matchAll(regex), // returns an iterator of all matches
  m => m[1]);             // select first capture group from each match

Working example: https://regex101.com/r/znX4Vu/1

Good Night Nerd Pride
  • 8,245
  • 4
  • 49
  • 65
  • You mean something like this ``` var regex = /([\d|,|.\+]+)<\/td>/g; var string = "result.table_html"; var matches = string.match(regex); ``` Look like I'm doing it wrong because the result I get is null. Any idea where is my mistake? – retro Dec 12 '20 at 14:03
  • I had best result with const regex = /(?:>)([\d|,|.\+]+) – retro Dec 12 '20 at 14:23
  • thanks for the update but now I receive an empty array. var re = /([\d|,|.\+]+)<\/td>/g; var matches = Array.from(result.table_html.matchAll(re), m => m[1]); console.log(matches) => [] – retro Dec 12 '20 at 14:36
  • @retro I can't reproduce your problem: https://jsfiddle.net/rdyz8o4j/ – Good Night Nerd Pride Dec 12 '20 at 14:52
  • @retro btw I also fixed your regex, because it had unwanted `|` pipe characters in a character class. – Good Night Nerd Pride Dec 12 '20 at 14:53
  • I really cant understand why some of the examples and yours also would not work. Maybe its the IDE. – retro Dec 12 '20 at 15:28
  • I'm still trying with const regex = /(?:>)([\d|,|.\+]+)/g; but cant get rid of ">" sign before all the numbers. – retro Dec 12 '20 at 17:12
  • @retro try to reproduce your problem with jsfiddle. Nobody want's to piece it together from comments. – Good Night Nerd Pride Dec 12 '20 at 18:44