It all depends on what you want to do with the data, i.e., how you want to process it.
For example, let's assume your interest is in parsing all XML tags as separate strings, then you can extract the tags using regular expression and the function str_extract
:
library(stringr)
str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")
This regex works even if the XML element names are variable:
str_extract_all(dat, "<([^>]*)>.*</\\1>|<[^>]*>")
The result is a list:
[[1]]
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" \nmodelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" \nxmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">"
[2] "<d2lm:exchange>"
[3] "<d2lm:supplierIdentification>"
[4] "<d2lm:country>gb</d2lm:country>"
[5] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[6] "<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" \nxmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">"
[7] "<d2lm:feedType>Event Data</d2lm:feedType>"
[8] "<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime>"
[9] "<d2lm:publicationCreator>"
[10] "<d2lm:country>gb</d2lm:country>"
[11] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[12] "<d2lm:situation version=\"\" id=\"2922904\">"
[13] "<d2lm:headerInformation>"
[14] "<d2lm:areaOfInterest>national</d2lm:areaOfInterest>"
To turn the list into a dataframe:
datDF <- data.frame(tags = unlist(str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")))
EDIT:
If you want to have a dataframe with the text values between XML start tag and XML end tag, you can extract these tags and values along these lines:
datDF <- data.frame(
tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
)
datDF
tags values
1 <d2lm:country> gb
2 <d2lm:nationalIdentifier> NTIS
3 <d2lm:feedType> Event Data
4 <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
5 <d2lm:country> gb
6 <d2lm:nationalIdentifier> NTIS
7 <d2lm:areaOfInterest> national
Is this--roughly--what you had in mind?
DATA:
dat <- '<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>'