1

Please consider the following line from an XML file (generated from a third-party source):

<record ObTime="2017-05-10T23:30" data_value="Ocean Park "The Sea WX"  WA US" />

As you can see, the attribute data_value has quoted string inside the value, which causes XML validators to giggle and explode.

Any given XML file could have thousands of lines. Is there a way to apply REGEX to a whole file? And, what would the REGEX be to replace quotes with something more benign?

Mi-Creativity
  • 9,554
  • 10
  • 38
  • 47
fslap
  • 55
  • 5

2 Answers2

2

There might be other, and better, solutions, but this is how I made it works:

  • Using preg_match_all with certain regex to capture all matches, and store them in an array $matches[0].
  • The regex: (?<=data_value=").*(?=" \/>) will capture everything between data_value=" and " />), by making use of positive lookbehind and lookahead, precisely match the values of each of the data_value attributes.
  • Loop through items in $matches[0] and we do the following:
    1. Replace every double qoutes string " with % [could be any other string, even blank, that doesn't cause further problems] in every single match, and store it in a temporary variable $str.
    2. Then replace the value of each match in the whole data string with the value of the modified version of the match, the $str string.

PHP code:
remember that because the data is xml tags, you need to use "view source" in order to see the output, alternatively, you can use var_dump instead of echo

<?php
$data = '<record ObTime="2017-05-10T23:30" data_value="Ocean Park "The Sea WX"  WA US" />
<record ObTime="2017-11-10T23:30" data_value="Some Other "Demo Text"  In Here" />';

$data_valueVal = preg_match_all('#(?<=data_value=").*(?=" \/>)#i', $data, $matches);

foreach($matches[0] as $match) {
    $str = str_replace('"', "%", $match);
    $data = str_replace($match, $str, $data);
}
echo $data;
?>

Output:

<record ObTime="2017-05-10T23:30" data_value="Ocean Park %The Sea WX% WA US" /> <record ObTime="2017-11-10T23:30" data_value="Some Other %Demo Text% In Here" />

Mi-Creativity
  • 9,554
  • 10
  • 38
  • 47
  • 1
    thank you very much. I'll look at applying this to each file before parsing. – fslap May 11 '17 at 19:54
  • You're welcome and I'm glad it helped.. enjoy coding! – Mi-Creativity May 11 '17 at 20:03
  • for the new xml data sample, use this regex **`(?<=data_value=")[^=]+(?=" (?:\w+=)?)`** [**Regex Demo**](https://regex101.com/r/uhnJH6/3) instead of the one in my answer, it can capture both `data_value` attributes – Mi-Creativity May 11 '17 at 21:29
  • And if you only want to capture the second `data_value` , use **`(?<=data_value=")[^=]+(?=" \/>)`** instead, [**regex Demo**](https://regex101.com/r/uhnJH6/4) – Mi-Creativity May 11 '17 at 21:47
1

Using Regex below, you are able to match those double quotes separately for further modifications:

(?:="|"\s+(?:\w+="|\/>))(*SKIP)(?!)|"

By using (*SKIP)(?!) you force engine to jump over first side of alternation after each successful match.

Live demo

PHP code (removing quotes):

echo preg_replace('~(?:="|"\s+(?:\w+="|\/>))(*SKIP)(?!)|"~', '', $xml);
Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117
  • 1
    That's a nice one, never heard of this `(*SKIP)(?!)` before, Up Voted! – Mi-Creativity May 11 '17 at 05:20
  • Wow... SKIP is cool. I'm brand new to REGEX so everything is like magic to me... but this is the first I've seen of SKIP. Thanks for your answer! – fslap May 11 '17 at 19:55
  • @revo, It looks like the quote in the opening tag is caught as well. This demo has more verbose data to look at. https://regex101.com/r/toFV9f/4 – fslap May 11 '17 at 20:30
  • You may want to change `\/>` part in regex to `[\/?]>`. – revo May 11 '17 at 20:56
  • this why you should always provide a [**MCVE**](https://stackoverflow.com/help/mcve), the data sample in the question ***differs*** than the data you have put in the regex101 example.. that changes the rules whether in revo's or my answers. – Mi-Creativity May 11 '17 at 20:58
  • It doesn't work if there is no ending space like this `` – TomSawyer Aug 26 '18 at 06:34
  • 1
    @TomSawyer Change `\s+` to `\s*`. – revo Aug 26 '18 at 06:37
  • Thank you, i want to convert it to go but it's the pain since `(*SKIP)` isn't available :( – TomSawyer Aug 26 '18 at 07:18