0

So... I have been using gsub in R for close to a decade now, and I think I just hit on the weirdest issue I have ever seen.

I am trying to parse the method from a Bruker Mass Spectrometer .d file (actually, it's a complex folder with many files, some human readable). The method file (an example can be downloaded from https://wetransfer.com/downloads/c29a29ee8c074d1e8002c3c93cace61320230818104101/b73f0c ) is an utf-8 encoded xml file. Where it gets weird, is that this xml has only a few fields, and the actual method is contained as a single very long character within one of those (field "ModuleMethodData"). That character is essentially a second xml with different encoding wrapped into the first: in it, the following characters are written as follows:

  • & = &
  • > = >
  • &lt; = <

Since a) I know very little about xml, b) the xml's structure is going to follow a very specific, predictable pattern, and c) when working with similar xmls in the past, I have always needed very limited and precise information which I could fish out quickly with a regex, my first attempt was to fish out the nested xml then parse it using grep and gsub. However, doing this I am running into an unexpected difficulty:

> fl <- ... # path to hystar.method file
> meth <- readLines(fl)
> inst <- gsub(" *</?DeviceName> *", "", grep("<DeviceName>", meth, value = TRUE)) # Get nested xml
> lc <- gsub(" *</?ModuleMethodData[^>]+> *", "", grep("<ModuleMethodData[^>]+>", meth, value = TRUE))
> lc <- gsub("\\&amp;", "&", gsub("\\&gt;", ">", gsub("\\&lt;", "<", lc)))
> lc <- gsub("\\&amp;", "&", gsub("\\&gt;", ">", gsub("\\&lt;", "<", lc))) # For some weird reason I have to do this a second time!!!

My issue is that when trying to re-introduce <, > and & using gsub(...), I have to do 2 rounds because only some of the target character groups are replaced the first time over. I have no idea why the regex only matches originally to some instances, but why it does catch those same instance after the first round of gsub (possibly some silent cleaning up of encoding happening in the background?) The regexes do not overlap or clash so normally I would expect to replace all matches in one round for each. Loading the file in notepad++ does not reveal any hidden characters.

Assuming others can reproduce this weird behavior, does anyone know how to deal with this and what could be the cause?

user3005996
  • 21
  • 1
  • 3
  • 4
    `&` and such are "HTML entities", and can be decoded programmatically using (e.g.) answers from https://stackoverflow.com/q/5060076/3358272. I can't get to the sample data, it appears to be behind a paywall. Please provide sample data (or update the link) we can use. – r2evans Aug 18 '23 at 12:25
  • 1
    I just verified that both answers work, and with a vector of 363 such entities (those listed at https://www.freeformatter.com/html-entities.html), `unescape_html2` worked over 150x faster. (When working on a single string with all of the entities, they were the same speed, so the improvement in `unescape_html2` is when dealing with vectors of strings.) – r2evans Aug 18 '23 at 12:32
  • Thank you very much! I should add that I was not so much interested in a fix - the obvious answer was to actually learn to use package XML which works well - but the question of why a character chain which does not look weirdly encoded (or if it is, I cannot see it in NotePad++) still behaves such that the same character chain in different contexts will be matched or not by a simple regex. – user3005996 Aug 18 '23 at 20:01
  • 1
    The most direct answer to my original issue with reading correctly the nested xml was to use XML::xmlToList. Surprise! The nested xml was properly encoded (and again parsable as a list) once accessed with this method. Sometimes bad habits learned as a beginner are hard to outgrow :( – user3005996 Aug 18 '23 at 20:05
  • You should not use [regex on X|HTML](https://stackoverflow.com/a/1732454/1422451). Use proper DOM libraries like R's XML to parse content. Please post sample of XML in body of question and not as link that can go dead for future readers. – Parfait Aug 18 '23 at 21:00

0 Answers0