1

I'm reading data data and trying to convert it to data frame to save it into readable format. However no clue about converting the dat data. A bit beginner to R. Any help will be highly appreciated.

Code so Far:

data <- readLines("Day8.dat")

print(data)

Output So Far:

[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" 
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" 
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\"> 
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country> 
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange> 
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" 
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType> 
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator> 
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier> 
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation> 
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>
....

Thanks

target021
  • 73
  • 8
  • Does the [following](https://stackoverflow.com/questions/17198658/how-to-parse-xml-to-r-data-frame) answer your question? – tpetzoldt May 30 '20 at 10:54
  • yes the output needs to be just like that however the tags needs to be the column name thanks – target021 May 30 '20 at 11:58
  • Can you point us to a better example? The above example is missing some parts, which makes it unuseable. A typical approach would include tools from packages `xml2` or `XML`. Regex should be avoided. –  May 30 '20 at 12:59

1 Answers1

2

It all depends on what you want to do with the data, i.e., how you want to process it. For example, let's assume your interest is in parsing all XML tags as separate strings, then you can extract the tags using regular expression and the function str_extract:

library(stringr)
str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")

This regex works even if the XML element names are variable:

str_extract_all(dat, "<([^>]*)>.*</\\1>|<[^>]*>")

The result is a list:

[[1]]
 [1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" \nmodelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" \nxmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">"
 [2] "<d2lm:exchange>"                                                                                                                                                                                                                                                                           
 [3] "<d2lm:supplierIdentification>"                                                                                                                                                                                                                                                             
 [4] "<d2lm:country>gb</d2lm:country>"                                                                                                                                                                                                                                                           
 [5] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"                                                                                                                                                                                                                                   
 [6] "<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" \nxmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">"                                                                                                                                                    
 [7] "<d2lm:feedType>Event Data</d2lm:feedType>"                                                                                                                                                                                                                                                 
 [8] "<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime>"                                                                                                                                                                                                                
 [9] "<d2lm:publicationCreator>"                                                                                                                                                                                                                                                                 
[10] "<d2lm:country>gb</d2lm:country>"                                                                                                                                                                                                                                                           
[11] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"                                                                                                                                                                                                                                   
[12] "<d2lm:situation version=\"\" id=\"2922904\">"                                                                                                                                                                                                                                              
[13] "<d2lm:headerInformation>"                                                                                                                                                                                                                                                                  
[14] "<d2lm:areaOfInterest>national</d2lm:areaOfInterest>"   

To turn the list into a dataframe:

datDF <- data.frame(tags = unlist(str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")))

EDIT:

If you want to have a dataframe with the text values between XML start tag and XML end tag, you can extract these tags and values along these lines:

datDF <- data.frame(
  tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
  values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
) 
datDF
                       tags                        values
1            <d2lm:country>                            gb
2 <d2lm:nationalIdentifier>                          NTIS
3           <d2lm:feedType>                    Event Data
4    <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
5            <d2lm:country>                            gb
6 <d2lm:nationalIdentifier>                          NTIS
7     <d2lm:areaOfInterest>                      national

Is this--roughly--what you had in mind?

DATA:

dat <- '<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" 
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" 
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\"> 
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country> 
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange> 
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" 
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType> 
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator> 
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier> 
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation> 
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>'
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Hi thanks the output needs to be in the form of table like the column name "area of interest" and the value is "national" and so on – target021 May 30 '20 at 11:55
  • Hi thanks for the help. can you help with applying this to this https://drive.google.com/file/d/1y7IMpsnrCXSZXXFU4F6SUvDUFGeWPAnt/view?usp=sharing file as it shows the error "arguments imply differing number of rows: ,", thanks – target021 Jun 01 '20 at 10:42