So... I have been using gsub in R for close to a decade now, and I think I just hit on the weirdest issue I have ever seen.
I am trying to parse the method from a Bruker Mass Spectrometer .d file (actually, it's a complex folder with many files, some human readable). The method file (an example can be downloaded from https://wetransfer.com/downloads/c29a29ee8c074d1e8002c3c93cace61320230818104101/b73f0c ) is an utf-8 encoded xml file. Where it gets weird, is that this xml has only a few fields, and the actual method is contained as a single very long character within one of those (field "ModuleMethodData"). That character is essentially a second xml with different encoding wrapped into the first: in it, the following characters are written as follows:
&
=&
>
=>
<
=<
Since a) I know very little about xml, b) the xml's structure is going to follow a very specific, predictable pattern, and c) when working with similar xmls in the past, I have always needed very limited and precise information which I could fish out quickly with a regex, my first attempt was to fish out the nested xml then parse it using grep and gsub. However, doing this I am running into an unexpected difficulty:
> fl <- ... # path to hystar.method file
> meth <- readLines(fl)
> inst <- gsub(" *</?DeviceName> *", "", grep("<DeviceName>", meth, value = TRUE)) # Get nested xml
> lc <- gsub(" *</?ModuleMethodData[^>]+> *", "", grep("<ModuleMethodData[^>]+>", meth, value = TRUE))
> lc <- gsub("\\&", "&", gsub("\\>", ">", gsub("\\<", "<", lc)))
> lc <- gsub("\\&", "&", gsub("\\>", ">", gsub("\\<", "<", lc))) # For some weird reason I have to do this a second time!!!
My issue is that when trying to re-introduce <, > and & using gsub(...), I have to do 2 rounds because only some of the target character groups are replaced the first time over. I have no idea why the regex only matches originally to some instances, but why it does catch those same instance after the first round of gsub (possibly some silent cleaning up of encoding happening in the background?) The regexes do not overlap or clash so normally I would expect to replace all matches in one round for each. Loading the file in notepad++ does not reveal any hidden characters.
Assuming others can reproduce this weird behavior, does anyone know how to deal with this and what could be the cause?