0

I want to read my xml into a dataframe in r. My intial Datafile is 14 GB so my initial try to read the file didn't work out:

f=xmlParse("Final.xml")
df=xmlToDataFrame(f)
r=xmlRoot(f)

The problem is that it is always running out of memory....

I've also seen the question:

How to read large (~20 GB) xml file in R?

I tried to use the approach from Martin Morgan, which i didn't 100% understood but tried to apply to my dataset.

libary(XML)
branchFunction <- function() {
store <- new.env() 
func <- function(x, ...) {
 ns <- getNodeSet(x, path = "//Sentiment")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ... 
# store[[some_key]] <- some_value
}
getStore <- function() { as.List(store) }
list(ROW = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
file = "Inputfile.xml", 
handlers = NULL, 
branches = myfunctions
))

myfunctions$getStore()

I would have to do that for every Column separately and the structure i'm getting from the ouptput is not useful.

The Structure from my Data looks like:

<ROWSET>
<ROW>
    <Field1>21706</Field1>
    <PostId>19203</PostId>
    <ThreadId>38</ThreadId>
    <UserId>1397</UserId>
    <TimeStamp>1407351854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>egergeg</ReplyTo>
    <Content>dsfg</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
    <Field1>217</Field1>
    <PostId>1903</PostId>
    <ThreadId>8</ThreadId>
    <UserId>197</UserId>
    <TimeStamp>1407351854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>sdrwer</ReplyTo>
    <Content>wer</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
    <Field1>21306</Field1>
    <PostId>19103</PostId>
    <ThreadId>78</ThreadId>
    <UserId>13497</UserId>
    <TimeStamp>1407321854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>tzjtj</ReplyTo>
    <Content>rtgr</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
</ROWSET>
Community
  • 1
  • 1
Carlo
  • 397
  • 1
  • 3
  • 14
  • You should provide a *complete* minimal example of your XML, otherwise it's really hard for others to be able to test their suggested solution. – Thomas Jun 11 '15 at 13:30
  • I expanded the example xml, it seems that a part of the xml got cut out while i posted it. – Carlo Jun 11 '15 at 13:38

1 Answers1

1

In your case, since you deal with big datasets, you should indeed use xmlEventParse which relies on the SAX, ie the Simple API for XML.The advantage of this vs. using xmlParse is that you will not load the XML tree in R (which can cause memory leaks if data is really big...).

I don't have a big dataset in hands, so i cannot test in real conditions but you can try this code snippet:

xmlDoc <- "Final.xml"
result <- NULL

#function to use with xmlEventParse
row.sax = function() {
    ROW = function(node){
            children <- xmlChildren(node)
            children[which(names(children) == "text")] <- NULL
            result <<- rbind(result, sapply(children,xmlValue))
          }
    branches <- list(ROW = ROW)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#and here is your data.frame
result <- as.data.frame(result, stringsAsFactors = F)

Let me know how it runs!

eblondel
  • 603
  • 4
  • 10
  • I tried to run the code on a 4 GB dataset, The Problem is that the xmlEventParse creates only an empty list. It worked well on a small dataset but wasn't able to create the Dataframe on the big Dataset. @eblondel – Carlo Jun 18 '15 at 13:10
  • the xmlEventParse returns an empty list indeed, but you should not care about what ``xmlEventParse`` returns (you can embedd the ``xmlEventParse`` with ``invisible()``. With the function ``row.sax``, we feed the ``result`` object. The use of ``xmlEventParse`` prevents from having to duplicate the memory used in R (the XML tree + the resulting ``data.frame``). The code i provided allows you to have only the ``data.frame``. Do you obtain an error with the big dataset? to see if i can help you – eblondel Jun 18 '15 at 16:04
  • On a small xml the xmlEventParse creates a Large Matrix, the problem is that the object created by xmlEventParse creates a NULL(empty) Value instead of a Large Matrix. Furthermore there is no Error or Warning occuring when the xmlEventParse finishes. Also the process would be too fast, it should take at least 20 min and stops/finishes after 5 minutes. @eblondel – Carlo Jun 19 '15 at 10:40