0

I am pretty new to R. I scraped a website that required login yesterday, the page is xml format like below.

<result status="success">
  <code>1</code>
  <note>success</note>
  <teacherList>
    <teacher id="D95">
      <name>Mary</name>
      <department id="420">
        <name>Math</name>
      </department>
      <department id="421">
        <name>Statistics</name>
      </department>
    </teacher>
    <teacher id="D73">
      <name>Adam</name>
      <department id="412">
        <name>English</name>
      </department>
    </teacher>
  </teacherList>
</result> 

Recently I just Converted an XML to a list.

library(XML)
library(rvest)
library(plyr)
library(dplyr)
library(httr)
library(pipeR)
library(xml2)

url.address <- "http://xxxxxxxxxxxxxxxxx"
session <-html_session(url.address)
form <-html_form(read_html(url.address))[[1]]
filled_form <- set_values(form,
                          "userid" = "id",
                          "Password" = "password")
s <- submit_form(session,filled_form)
z = read_xml(s$response)
z1 = as_list(z)
z2 <- z1$teacherList

Now I need to extract data from a list and make it as a data frame. By the way, some people belong to 2 departments, but some only belong to 1. A part of the list z2 looks like below:

z2[[1]]

$name
$name[[1]]
[1] "Mary"


$department
$department$name
$department$name[[1]]
[1] "Math"


attr(,"id")
[1] "420"

$department
$department$name
$department$name[[1]]
[1] "statistics"


attr(,"id")
[1] "421"

attr(,"id")
[1] "D95236"

When I extracted them one by one, it took too long:

attr(z2[[1]],"id")

"D95"

z2[[1]][[1]][[1]] 

"Mary"

z2[[1]][[2]][[1]][[1]] 

"Math"

attr(z2[[1]][[2]], "id") 

"420"

z2[[1]][[3]][[1]][[1]] 

"statistics"

attr(z2[[1]][[3]], "id")

"421"

attr(z2[[2]],"id")

"D73"

z2[[2]][[1]][[1]] 

"Adam"

z2[[2]][[2]][[1]][[1]]

"English"

attr(z2[[2]][[2]],"id")

"412"

So I tried to write a loop:

for (x in 1:2){
  for (y in 2:3){
  a <- attr(z2[[x]],"id")
  b <- z2[[x]][[1]][[1]]
  d <- z2[[x]][[y]][[1]][[1]]
  e <- attr(z2[[x]][[y]],"id")
  g <- cbind(print(a),print(b),print(d),print(e))
  }}

but it doesn't work at all since some of the people only belong to one department. The result I expected:

enter image description here

Any advice would be appreciated!

dput(head(z2, 10))

structure(list(teacher = structure(list(name = list("Mary"), 
    department = structure(list(name = list("Math")), .Names = "name", id = "420"), 
    department = structure(list(name = list("statistics")), .Names = "name", id = "421")), .Names = c("name", 
"department", "department"), id = "D95"), teacher = structure(list(
    name = list("Adam"), department = structure(list(name = list(
        "English")), .Names = "name", id = "412")), .Names = c("name", 
"department"), id = "D73"), teacher = structure(list(name = list(
    "Kevin"), department = structure(list(name = list("Chinese")), .Names = "name", id = "201")), .Names = c("name", 
"department"), id = "D101"), teacher = structure(list(name = list(
    "Nana"), department = structure(list(name = list("Science")), .Names = "name", id = "205")), .Names = c("name", 
"department"), id = "D58"), teacher = structure(list(name = list(
    "Nelson"), department = structure(list(name = list("Music")), .Names = "name", id = "370")), .Names = c("name", 
"department"), id = "D14"), teacher = structure(list(name = list(
    "Esther"), department = structure(list(name = list("Medicine")), .Names = "name", id = "361")), .Names = c("name", 
"department"), id = "D28"), teacher = structure(list(name = list(
    "Mia"), department = structure(list(name = list("Chemistry")), .Names = "name", id = "326")), .Names = c("name", 
"department"), id = "D17"), teacher = structure(list(name = list(
    "Jack"), department = structure(list(name = list("German")), .Names = "name", id = "306")), .Names = c("name", 
"department"), id = "D80"), teacher = structure(list(name = list(
    "Tom"), department = structure(list(name = list("French")), .Names = "name", id = "360")), .Names = c("name", 
"department"), id = "D53"), teacher = structure(list(name = list(
    "Allen"), department = structure(list(name = list("Spanish")), .Names = "name", id = "322")), .Names = c("name", 
"department"), id = "D18")), .Names = c("teacher", "teacher", 
"teacher", "teacher", "teacher", "teacher", "teacher", "teacher", "teacher", 
"teacher"))
Ching
  • 135
  • 1
  • 9
  • It will not be possible to help unless you provide a reproducible example of your data. try `dput(head(z2, 10))` and paste the result into your question. – lmo Sep 06 '17 at 15:42
  • @lmo sorry! Just added :) – Ching Sep 06 '17 at 16:28
  • 1
    please do not paste images of code. And please read [how to make a great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – C8H10N4O2 Sep 06 '17 at 16:30
  • @lmo Just upload it now. I am sorry that I haven't figured out how to post the output, so I uploaded the image. Sorry for the inconvenience. – Ching Sep 06 '17 at 17:01
  • @C8H10N4O2 Hi! I am very sorry, just started using it two days ago. I know this should not be my excuses. I will try to figure out how to do asap. – Ching Sep 06 '17 at 17:02
  • @Ching I modified the code to fit your second example. If you run against a new problem with the data structure, please ask a new question on SO. Remember to `dput` an example of your data and also add a link to this question so that people can refer to it. – lmo Sep 07 '17 at 11:45

2 Answers2

2

This was a bit crazy to construct, but I think it more or less conforms with the desired output posted in a previous version of the post. I had to use sapply within the lapply function to pull out the second ID variable.

do.call(rbind,             # rbind list of data.frames output by lapply
        lapply(unname(z2), # loop through list, first drop outer names
               function(x) { # begin lapply function
                 temp <- unlist(x) # unlist inner elements to a vector
                 data.frame(name=temp[names(temp) == "name"], # subset on names
                            dept=temp[names(temp) == "department.name"], # subset on dept
                            id=attr(x, "id"), # extract one id
                            id2=unlist(sapply(x, attr, "id")), # extract other id
                            row.names=NULL) # end data.frame function, drop row.names
                            })) # end lapply function, lapply, and do.call

this returns

     name       dept   id id2
1    Mary       Math  D95 420
2    Mary statistics  D95 421
3    Adam    English  D73 412
4   Kevin    Chinese D101 201
5    Nana    Science  D58 205
6  Nelson      Music  D14 370
7  Esther   Medicine  D28 361
8     Mia  Chemistry  D17 326
9    Jack     German  D80 306
10    Tom     French  D53 360
11  Allen    Spanish  D18 322

The structure of the second list differs in a number of ways from the initial example. First: one nest is removed. That is, the depth of the new list is one less than that of the initial example. It would be as if you provided z2[[1]] for the initial list. Second, the second example is missing what I called id initially (values such as D95 and D101).

With a bit of manipulation of the original code, I got this to work with

lapply(list(z3), # loop through list, first drop outer names
       function(x) { # begin lapply function
           temp <- unlist(x) # unlist inner elements to a vector
           data.frame(name=temp[names(temp) == "name"], # subset on names
                      dept=temp[names(temp) == "department.name"], # subset on dept
                      # id=attr(x, "id"), # extract one id
                      id2=unlist(sapply(x, attr, "id")), # extract other id
                      row.names=NULL) # end data.frame function, drop row.names
       })

The changes to the code address what I mentioned before z2 is replaced by list(z3) as the first argument to lapply, which constructs the needed list depth. Also, the line of the inner function id=attr(x, "id"), has been commented out as id2 does not exist.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • this is very neat! Thank you so much :) – Ching Sep 07 '17 at 01:07
  • I tried to use the do.call function to solve another one which has less structure, but it went error like "Error in data.frame(names(temp) == "name", division = temp[names(temp) == : arguments imply differing number of rows: 1, 0 ". Would you kindly tell me which parts go wrong? Even just a hint will be great :) I just list it above – Ching Sep 07 '17 at 02:55
  • @Ching I modified the code to work with your second example. It is important to understand your underlying data structure when working on a problem. Here, you should have seen that the the id variable was missing from this data when you printed a small example on your screen. – lmo Sep 07 '17 at 11:43
  • Hi! It didn't work when I use list(z3), but when I do it with unname(z3) it works again. That's pretty strange. I did notice the difference of structure. – Ching Sep 07 '17 at 15:36
  • 1
    Wait! I knew why it didn't work, the second structure I listed above was the wrong one(which I didn't notice earlier). Thank you so much! You really save me. This structure really drives me crazy :p – Ching Sep 07 '17 at 15:42
  • Would you be kindly teach me how to extract the data from one of the attr that I just listed in 0908 question. I will do the rest by myself. Thank you! :) – Ching Sep 08 '17 at 02:25
  • Please delete this new material and ask a new question. In your new question, `dput` a portion of the data. It is much easier to work with. – lmo Sep 08 '17 at 09:38
  • I just did. Actually my new listed loop is more complicated (I just realized!). So the do.call function I tried before didn't output the right information I need. – Ching Sep 09 '17 at 07:41
  • I use the most simple and ineffective way to do my new question. This is easier for me to understand, but I would still like to learn do.call function if you are willing to give me instruction. I am very thankful! Here's the new question if you want to have a look: https://stackoverflow.com/questions/46128164/extract-data-from-a-nested-list-with-loops?noredirect=1#comment79220279_46128164 – Ching Sep 09 '17 at 11:47
  • @IMO, please I want your assistance on how to extract the output in the list and how to make on the following r-code: > simplex_output, and I got the following result : [[1]]$`params` E tau tp nn 1 1 1 1 2 [[1]]$model_output time obs pred pred_var 1 2 45 NaN NaN 2 3 96 NaN NaN [[1]]$stats num_pred rho mae rose 1 1 NaN 27.41743 27.41743, I want to extract only the mae value and this is from one dataset and I want to make it for many datasets in a loop? any advice, I would be appreciated? – Stackuser Feb 20 '20 at 00:20
0

XML is generally really easy to deal with in R

Use library(XML) and library(plyr) to avoid having to write loops:

Step one is to read in the XML

I saved your sample XML as a .xml file called Demo.xml. You can also pass xmlParse a URL.

rawXML <- xmlParse("Demo.xml")

Then convert XML to list:

xmlList <- xmlToList(rawXML)

Then convert list to data frame with plyr

df1 <- ldply(xmlList, data.frame)

This is the general process, if you provide sample data we can refine it to match your specific use case.

Here's the resulting summary output. Is this what you're looking for?

 str(df1)
'data.frame':   4 obs. of  12 variables:
 $ .id                        : chr  "code" "note" "teacherList" ".attrs"
 $ X..i..                     : Factor w/ 2 levels "1","success": 1 2 NA 2
 $ teacher.name               : Factor w/ 1 level "Mary": NA NA 1 NA
 $ teacher.department.name    : Factor w/ 1 level "Math": NA NA 1 NA
 $ teacher.department..attrs  : Factor w/ 1 level "420": NA NA 1 NA
 $ teacher.department.name.1  : Factor w/ 1 level "Statistics": NA NA 1 NA
 $ teacher.department..attrs.1: Factor w/ 1 level "421": NA NA 1 NA
 $ teacher..attrs             : Factor w/ 1 level "D95": NA NA 1 NA
 $ teacher.name.1             : Factor w/ 1 level "Adam": NA NA 1 NA
 $ teacher.department.name.2  : Factor w/ 1 level "English": NA NA 1 NA
 $ teacher.department..attrs.2: Factor w/ 1 level "412": NA NA 1 NA
 $ teacher..attrs.1           : Factor w/ 1 level "D73": NA NA 1 NA
Mako212
  • 6,787
  • 1
  • 18
  • 37