2

I'm trying to convert dat data to data frame using the code however it is giving the mentioned error. Any help will be highly appreciated. the data file is also attached https://drive.google.com/file/d/1y7IMpsnrCXSZXXFU4F6SUvDUFGeWPAnt/view?usp=sharing

Code So Far:

library(XML)
require(plyr)
library(stringr)
dat <- readLines("NTISDATD-Events-2020-05-10-Day8.dat")

datDF <- data.frame(
    tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
    values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
) 
datDF

Desired Output:

                       tags                        values
 1            <d2lm:country>                            gb
 2 <d2lm:nationalIdentifier>                          NTIS
 3           <d2lm:feedType>                    Event Data
 4    <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
 5            <d2lm:country>                            gb
 6 <d2lm:nationalIdentifier>                          NTIS
 7     <d2lm:areaOfInterest>                      national

Many Thanks

aliahmed
  • 35
  • 4
  • 1
    Hi aliah. There are several problems here. Effectively `dat` is a vector of character strings, each of which is a complete xml document. You are trying to extract the text using regex instead of an xml parser even though you have loaded and attached the XML package. Secondly, even if you use XML to read the data, there are different fields in each xml document, so you won't be able to create a data frame from them unless there are certain fields you wish to pull out from each document to coerce into a data frame. This is certainly possible, but no simple function will do it for you. – Allan Cameron Jun 01 '20 at 16:41
  • Thanks Allan, any help with this will be appreciative. if any function which can make data into dataframe. thanks – aliahmed Jun 01 '20 at 16:45
  • aliah, you simply **cannot turn this into a data frame with a simple function**. The data **will not fit into a data frame.** Unless you can edit your question to specify the _exact_ columns you want in your data frame, it cannot be done in a sensible way. For example, try `as.data.frame(dat)` and you'll see that just telling R to turn something into a data frame doesn't give the result you want. – Allan Cameron Jun 01 '20 at 16:54
  • To follow up on Allan's comment. I tried parsing the entire file into a data.frame with the `xml2` package, but I gave up after 10 minutes. It's too slow to do the whole thing. – Ian Campbell Jun 01 '20 at 18:22
  • Thanks Ian Campbell, as being a beginner in R language i tried some functions however not getting any positive response. can I get the script as I'm totally fine with the time if it takes longer. I've also tried to reduce the data in order to check https://drive.google.com/file/d/1-vJ568Y_IctWisyd44mD-k2pQ73IL5rC/view?usp=sharing but the same error. if the script works with it it would be good. many thanks – aliahmed Jun 03 '20 at 17:18
  • Thanks for the help @AllanCameron. I've tried to reduce the data in order to remove the error. as the error is same "arguments imply differing number of rows: 1074, 12". Any help would be very appreciative. Thanks – aliahmed Jun 03 '20 at 17:20

0 Answers0