0

I'm sort of stuck with my code. A subset of my XML file writes to csv correctly. But the actual file which is 18MB large doesnt write to csv correctly. The code is exactly the same. I've tried lots of things but its just not working. I have asked a different question here 2 days ago and have moved up to this point. Please help. My code:

doc<-xmlTreeParse("PrimaryCatchments.xml", options=HUGE)
top<-xmlRoot(doc)
extract <- xmlSApply(top, function(x) xmlSApply(x, xmlValue)) 
extract_df <- data.frame(t(extract),row.names=NULL)
my.df <- data.frame(lapply(extract_df, as.character), stringsAsFactors=FALSE)
write.csv(my.df,"Extract.csv", row.names=FALSE)

EDIT: With subset of data the above code writes name, ExtendedData, coordinates in three different columns. But with the full large XML file this code writes Placemark, Placemark1, Placemark2.........Placemark1016 So theres 1016 columns with data. It also has values that begin with "c(" How do i get my large XML file to write 3 columns and 1016 rows?

Community
  • 1
  • 1
CuriousBeing
  • 1,592
  • 14
  • 34
  • Please be specific on the issue. What do you mean "doesn't write to csv correctly", "it's just not working", "it's a mess", or "really bad"? Is data spaced out across rows or columns? Any error message? If code works for smaller subset but not full file chances are its data in full file and not code. You may need to compare both xml datasets and see patterns in coordinates element. Also, recall csvs are comma separated. Hence, if parsed data contains commas, it may separate such data across columns. – Parfait Apr 05 '15 at 23:59
  • @Parfait I have edited my Q. Please help me. I've tried everything. Both the files are same code. One being really small thats it. – CuriousBeing Apr 06 '15 at 00:20
  • Thanks for that clarification. Now, did you compare both xml files -subset and full file? I am assuming your link is subset? Check to see if structures are the same especially very beginning and end? Is the node the only element wrapping the entire full file document? I am guessing there is another outer element and so parser only iterates Placemarkers, the next level down. See this [SO explanation](http://stackoverflow.com/questions/19743394/how-does-xmlsapply-work). – Parfait Apr 06 '15 at 00:50
  • @Parfait Yes. Both my files are the same. You can have a look. Here is the [link](https://www.dropbox.com/sh/xrngvpp7v2v1hgl/AABDIFe8xSmL-CkgIW7_Wqtua?dl=0) I'm doing this using plyr function now. I suspect it is the coordinates column, because it is tooooo long and with many commas theres something weird happening there) – CuriousBeing Apr 06 '15 at 01:00

1 Answers1

1

The problem indeed was the size of the xml file. However, I may have a solution which involves multiple steps. Essentially, you can take the large xml file and slice it into smaller pieces (e.g., 10 for 100 Placemark elements) and have R iterate through each of the subsetted xml files.

STEP 1: XSL SCRIPTING

To slice you can run this XML transform stylesheet, xsl file, selecting Placemark elements at subsequent positions.

FIRST 100 Placemarks

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template match="/">
    <xsl:element name="kml">
<xsl:for-each select="(//Placemark)[100 &gt;= position()]"><xsl:copy-of select="."/></xsl:for-each>
    </xsl:element>
</xsl:template>

</xsl:stylesheet>

NEXT 101 - 200 Placemarks

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template match="/">
    <xsl:element name="kml">
        <xsl:for-each select="(//Placemark)[(200 &gt;= position()) and (101 &lt;= position())]"><xsl:copy-of select="."/></xsl:for-each>
    </xsl:element>
</xsl:template>

</xsl:stylesheet>

...And so on until 1016 Placemarks

STEP 2: XML TRANSFORMATION

Next, use R, Python, PHP, VBA, or some other language to transform the large xml using these above separate 10 xsl files. R may have an xslt package (for this, I used Python's lxml module). In your code, after each transformation, save each smaller xml file separately (~1.5 MB each): PrimaryCatchments_100.xml, PrimaryCatchments_200.xml, etc. Each file will look like its larger 18 MB version but only contain 100 Placemarks.

STEP 3: R XML PARSING

Then, have R extract each smaller xml file (consider a loop lapply function).

doc<-xmlTreeParse("PrimaryCatchments_100.xml", options=OLDSAX)
top<-xmlRoot(doc)
extract <- xmlSApply(top, function(x) xmlSApply(x, xmlValue)) 
extract_df <- data.frame(t(extract),row.names=NULL)
my.df <- data.frame(lapply(extract_df, as.character), stringsAsFactors=FALSE)
write.csv(my.df,"Extract_100.csv", row.names=FALSE)
...

STEP 4: R CSV APPENDING

Finally, use R to append all 10 csv extract files together. Do note the maximum character count of an Excel cell is 32,767 and some coordinates had 54,000+ which forces new cells in the spreadsheet view of csv.

extract100df<-read.csv("Extract_100.csv")
extract200df<-read.csv("Extract_200.csv")
extract300df<-read.csv("Extract_300.csv")
extract400df<-read.csv("Extract_400.csv")
extract500df<-read.csv("Extract_500.csv")
extract600df<-read.csv("Extract_600.csv")
extract700df<-read.csv("Extract_700.csv")
extract800df<-read.csv("Extract_800.csv")
extract900df<-read.csv("Extract_900.csv")
extract1016df<-read.csv("Extract_1016.csv")

finaldf <- rbind(extract100df, extract200df, extract300df, extract400df, extract500df,
                 extract600df, extract700df, extract800df, extract900df, extract1016df)
write.csv <- ("FinalExtractData", row.names=FALSE)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • You are absolutely right. My file was indeed wrong with some extra attributes. Thank you so much for your help. I will be trying out your code now. – CuriousBeing Apr 06 '15 at 03:55