2

I'm working on a C++ program and I need to read a piece of meta-data from a TIF file. The meta-data is a string that looks like the following:

<GDALMetadata>
  <Item name="BANDWIDTH"></Item>
  <Item name="CENTER_FILTER_WAVELENGTH"></Item>
  <Item name="DATA_SET_ID">&amp;quot;LRO-L-LOLA-4-GDR-V1.0&amp;quot;</Item>
  <Item name="FILTER_NAME"></Item>
  <Item name="INSTRUMENT_ID">&amp;quot;LOLA&amp;quot;</Item>
  <Item name="INSTRUMENT_NAME">&amp;quot;LUNAR ORBITER LASER ALTIMETER&amp;quot;</Item>
  <Item name="MISSION_NAME"></Item>
  <Item name="NOTE"></Item>
  <Item name="PRODUCER_INSTITUTION_NAME">&amp;quot;GODDARD SPACE FLIGHT CENTER&amp;quot;</Item>
  <Item name="PRODUCT_CREATION_TIME">2017-09-15</Item>
  <Item name="START_TIME">2009-07-13T17:33:17</Item>
  <Item name="STOP_TIME">2016-11-29T05:48:19</Item>
  <Item name="OFFSET" sample="0" role="offset">1737400</Item>
  <Item name="SCALE" sample="0" role="scale">0.5</Item>
</GDALMetadata>

I need to extract the scale value (which in this case is 0.5). My first attempt was to use regex as follows:

float scale = 1;
std::regex rgx("*<Item name=\"SCALE\"*>(.*?)</Item>*");
std::smatch match;       
if (std::regex_search(metadata.begin(), metadata.end(), match, rgx)) {
    scale = static_cast<float>(std::atof(match.str().c_str()));
};

This did not work, and I'm unsure why. I'm very inexperienced with regex.

Obviously this looks like HTML but as I only need this one specific field I was thinking it should be simpler to simply try to extract that directly.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
Chris Gnam
  • 311
  • 2
  • 7
  • 5
    That looks like XML. Use an XML parser, not ad hoc regular expressions. – Barmar Jun 01 '23 at 23:06
  • 1
    You have a bunch of extra `*` characters in the regexp. Why? The only one that belongs is in `.*?` – Barmar Jun 01 '23 at 23:07
  • Kind of reminds me of this question: [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/q/1732348/4581301) – user4581301 Jun 01 '23 at 23:38
  • @Barmar yes you're right, I was being foolish. I'm using rapidxml and its working great. Thanks! – Chris Gnam Jun 02 '23 at 03:31

4 Answers4

2

You can find the substring between the 2 delimiters role="scale"> and </Item>, firstly you can remove all </Item> instances that are before role="scale">, so the substring works correctly, then find the substring between role="scale"> and </Item> using metadata.substr() and metadata.find().

#include <string>

float scale = 1;
while(metadata.find("</Item>") < metadata.find("role=\"scale\">"){
  metadata.replace(metadata.find("</Item>"), 7, "");
}
if(metadata.find("role=\"scale\">") != string::npos && metadata.find("</Item>") != string::npos){
  scale = stof(metadata.substr(metadata.find("role=\"scale\">") + 13, metadata.find("</Item>") - metadata.find("role=\"scale\">") - 13));
}
EthanSteel
  • 331
  • 8
2

Your regex string literal should be this instead:

"<Item name=\"SCALE\"[^>]*>(.*?)<\\/Item>"

IOW, drop the leading and trailing *, you don't need them. And use [^>]* instead of just * to ignore everything up to but not including > after "SCALE". And you need to escape the / in </Item> (in the regex itself, not in the string literal).

That being said, match.str() will return the entire substring that matched the regex, not the value in the (.*?) group as you are expecting. Thus, std::atof() will receive an invalid string and fail. To extract just the group value, use match[1].str() instead.

Lastly, consider using std::stof() instead of atof().

Try this:

float scale = 1;
std::regex rgx("<Item name=\"SCALE\"[^>]*>(.*?)<\\/Item>");
std::smatch match;       
if (std::regex_search(metadata.cbegin(), metadata.cend(), match, rgx)) {
    scale = std::stof(match[1].str());
}

Online Demo

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
2

I believe that using std::regex is to assume too much of the future format of the metadata, given that it is an XML text. XML can be shuffled, contain breaks and be in different order.

I would lean towards using a library that can parse and handle XML like libxml2 or boost::property_tree

Link.

The following example parses your metadata and prints the scale.

#include <string>
#include <iostream>
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/xml_parser.hpp>

std::string metadata = R"(
<GDALMetadata>
  <Item name="BANDWIDTH"></Item>
  <Item name="CENTER_FILTER_WAVELENGTH"></Item>
  <Item name="DATA_SET_ID">&amp;quot;LRO-L-LOLA-4-GDR-V1.0&amp;quot;</Item>
  <Item name="FILTER_NAME"></Item>
  <Item name="INSTRUMENT_ID">&amp;quot;LOLA&amp;quot;</Item>
  <Item name="INSTRUMENT_NAME">&amp;quot;LUNAR ORBITER LASER ALTIMETER&amp;quot;</Item>
  <Item name="MISSION_NAME"></Item>
  <Item name="NOTE"></Item>
  <Item name="PRODUCER_INSTITUTION_NAME">&amp;quot;GODDARD SPACE FLIGHT CENTER&amp;quot;</Item>
  <Item name="PRODUCT_CREATION_TIME">2017-09-15</Item>
  <Item name="START_TIME">2009-07-13T17:33:17</Item>
  <Item name="STOP_TIME">2016-11-29T05:48:19</Item>
  <Item name="OFFSET" sample="0" role="offset">1737400</Item>
  <Item name="SCALE" sample="0" role="scale">0.5</Item>
</GDALMetadata>)";

using namespace boost::property_tree;

int main() {
    std::istringstream input( metadata );
    ptree tree;
    read_xml(input, tree);
    auto items = tree.get_child("GDALMetadata", ptree());
    for (const auto& f: items) {
        auto p = f.second;
        std::string name = p.get<std::string>("<xmlattr>.name", "");
        if ( name=="SCALE" ) { 
            std::cout << "Scale: "<< p.data() << std::endl;
        }
    }
}

Results in

Program stdout
Scale: 0.5

Godbolt: https://godbolt.org/z/K94EW9YMf

Something Something
  • 3,999
  • 1
  • 6
  • 21
  • 1
    I ended up going with rapidxml instead but I agree that using an actual xml parser was the correct solution here. Thanks! – Chris Gnam Jun 02 '23 at 03:31
1

I would do this the old fashioned way. Read strings until you find the one with the word "scale", then parse this one in more detail:

std::istringstream metadata_stream(metadata_string);
std::string metadata_text_line;
bool found = false;
while (std::getline(metadata_text_line, metadata_stream))
{
    if (metadata_text_line.find("SCALE") != std::string::npos)
    {
        static const char    key_text[] = "\"scale\">";
        std::string::size_type position = metadata_text_line.find(key_text);
        if (position != std::string::npos)
        {
             std::string::npos value_start_position = (position + sizeof(key_text) - 1U);
             end_position = metadata_text_line.find(value_start_position, "<");
             std::string scale_text = metadata_text_line.substr(value_start_position,
                  end_position - value_startOposition);
             //...
        }
    }
}

This code presents a general idea or solution; there may be issues with it.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154