I want to read my xml into a dataframe in r. My intial Datafile is 14 GB so my initial try to read the file didn't work out:
f=xmlParse("Final.xml")
df=xmlToDataFrame(f)
r=xmlRoot(f)
The problem is that it is always running out of memory....
I've also seen the question:
How to read large (~20 GB) xml file in R?
I tried to use the approach from Martin Morgan, which i didn't 100% understood but tried to apply to my dataset.
libary(XML)
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//Sentiment")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ...
# store[[some_key]] <- some_value
}
getStore <- function() { as.List(store) }
list(ROW = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(
file = "Inputfile.xml",
handlers = NULL,
branches = myfunctions
))
myfunctions$getStore()
I would have to do that for every Column separately and the structure i'm getting from the ouptput is not useful.
The Structure from my Data looks like:
<ROWSET>
<ROW>
<Field1>21706</Field1>
<PostId>19203</PostId>
<ThreadId>38</ThreadId>
<UserId>1397</UserId>
<TimeStamp>1407351854</TimeStamp>
<Upvotes>0</Upvotes>
<Downvotes>0</Downvotes>
<Flagged>f</Flagged>
<Approved>t</Approved>
<Deleted>f</Deleted>
<Replies>0</Replies>
<ReplyTo>egergeg</ReplyTo>
<Content>dsfg</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
<Field1>217</Field1>
<PostId>1903</PostId>
<ThreadId>8</ThreadId>
<UserId>197</UserId>
<TimeStamp>1407351854</TimeStamp>
<Upvotes>0</Upvotes>
<Downvotes>0</Downvotes>
<Flagged>f</Flagged>
<Approved>t</Approved>
<Deleted>f</Deleted>
<Replies>0</Replies>
<ReplyTo>sdrwer</ReplyTo>
<Content>wer</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
<Field1>21306</Field1>
<PostId>19103</PostId>
<ThreadId>78</ThreadId>
<UserId>13497</UserId>
<TimeStamp>1407321854</TimeStamp>
<Upvotes>0</Upvotes>
<Downvotes>0</Downvotes>
<Flagged>f</Flagged>
<Approved>t</Approved>
<Deleted>f</Deleted>
<Replies>0</Replies>
<ReplyTo>tzjtj</ReplyTo>
<Content>rtgr</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
</ROWSET>