how can I parse this data using xml

Question

I have a data which can be downloaded from here http://mips.helmholtz-muenchen.de/proj/ppi/ At the end of the page , it is written "You can get the full dataset"

Then I tried to use xml package

library(XML)
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE)
root <- xmlRoot(doc)

but it seems empty

what do I want ?

if I open allppi.xml downloaded from that website, I want to parse specific lines into a txt file, it starts with <fullName> and end with </fullName>

for example if I open that file , I can see this

<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName>

Then I want to have this

Proteins                   description 
S100A8;CAGA;MRP8     calgranulin A (migration inhibitory factor-related protein 8)

you need to download and unzip the file first, then you can parse. [This shows a way](http://stackoverflow.com/questions/23899525/using-r-to-download-zipped-data-file-extract-and-import-csv). So try `temp <- tempfile() ; download.file("http://mips.helmholtz-muenchen.de/proj/ppi/data/mppi.gz", temp) ; unz(temp, "allppis.xml")`, and then `doc <- xmlTreeParse(temp, useInternal = TRUE) ; root <- xmlRoot(doc)` — user20650, Nov 15 '16 at 21:30
There is also this package that may be useful https://www.bioconductor.org/packages/release/bioc/html/RpsiXML.html — user20650, Nov 15 '16 at 21:30
@user20650 now I just type doc and I saw the xml is inside it but where does it save it? can you help me to get the exact output I want ? — Learner Algorithm, Nov 15 '16 at 22:28
okay, so you can download it. I do not know how to parse this - hence just the comment ^^ to help download. Did you look to see if RpsiXML had a schema? — user20650, Nov 15 '16 at 22:30
@user20650 Yes I am familiar with this package, most of these packages are written for a publication and I am not able to go inside them. However, I really appreciate your great help and I wait to see if someone will help me with the parsing or not — Learner Algorithm, Nov 15 '16 at 22:32
Please post a snippet of XML in body of post. Requiring us to download external zipped files may not encourage participation. — Parfait, Nov 16 '16 at 01:25

score 2 · Answer 1 · answered Nov 16 '16 at 05:07

I think you want something like this (the question is not very clear IMO). I also think the main issue was default namespaces, which are definitely a royal pain:

library(xml2)
library(purrr)
library(dplyr)
library(stringi)

doc <- read_xml("allppis.xml")

ns <- xml_ns_rename(xml_ns(doc), d1="x")

xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>% 
  xml_text() %>% 
  stri_split_fixed("; ", n=2, simplify=TRUE) %>% 
  as_data_frame() %>% 
  setNames(c("Proteins", "Description")) %>% 
  mutate(Proteins=trimws(Proteins),
         Description=trimws(Description))
## # A tibble: 3,628 × 2
##             Proteins                                                    Description
##                <chr>                                                          <chr>
## 1   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 2  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 3  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 4   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 5   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 6  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 7  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 8   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 9               TRP3                                 calcium influx channel protein
## 10            IP3R-3                  inositol 1,4,5-trisphosphate receptor, type 3
## # ... with 3,618 more rows

You'll need to clean that up a bit (View() the resultant data frame to see what I mean).

thank you very much! I have few concerns, 1- sometimes I see no protein ID but description , also is it possible to have the `db=`and ` id=` for each protein in another column ? I definitely accept your answer . Thanks again — Learner Algorithm, Nov 16 '16 at 10:53

how can I parse this data using xml

1 Answers1