3

I am trying to convert XML files to data frame, but it only shows few information in the column.

library(XML)

# LOADING TRANSFORMED XML INTO R DATA FRAME
doc <- xmlParse("SRR12545290.xml") # https://www.ncbi.nlm.nih.gov/sra/?term=SRR12545290
xmldf <- xmlToDataFrame(doc)
head(xmldf)

This only shows

 │EXPERIMENT                                                                                                               
1│SRX903458416S amplicon of salmon: distal intestinal digestaSRP279301Illumina 16S metagenomic targeted sequenci…
 │SUBMISSION
1│SRA1118818
 │STUDY                                                                                                                    
1│SRP279301PRJNA660116ArcticFloraDiet with or without functional feed ingredients were fed to salmon through fres…
 │SAMPLE                                                                                                                   
1│SRS7285186SAMN15936598FW-Ref749906gut metagenome['Distal intestinal digesta of Atlantic salmon', 'Distal intestinal dige…
 │Pool                  │RUN_SET                          
1│SRS7285186SAMN15936598│SRR12545290SRS7285186SAMN15936598

But instead, I wanted to get all the information present in the XML file. Like geographic location, host name etc.

shanky
  • 383
  • 2
  • 7
  • All of the data is there, it is just all of the child nodes are getting concatenated into the seven parent nodes. Because this file does not have a regular repeating structure what exactly are you looking for 1 row with 200 columns? – Dave2e Oct 12 '21 at 22:13
  • @Dave2e in short YES, I need 1 row with 200 columns – shanky Oct 13 '21 at 09:08

1 Answers1

1

Here is an approach to parse the entire XML (using the xml2 package) into obtain the values of all of the leaf nodes along with the path name.
Not sure if this is what you were looking for but a start.

library(xml2)
library(dplyr)    
doc<-read_xml("SRR12545290.xml")


#find all the nodes
allnodes <- doc %>% xml_find_all( '//*')

#find the leafs
leafs <- which( (allnodes %>% xml_children() %>% xml_length())==0)

#get the value in the leafs
value <- (allnodes %>% xml_text())[leafs]

#get the path to leaves to indentify the source
name <- (allnodes %>% xml_path())[leafs]
   
#clean up naming
name <- gsub("/EXPERIMENT_PACKAGE_SET/EXPERIMENT_PACKAGE/", "", name)

#final result
data.frame(name, value)
Dave2e
  • 22,192
  • 18
  • 42
  • 50