0

I currently have data with the following structure:

<table class="tableizer-table">
  <tr class="tableizer-firstrow">
    <th>Date</th>
    <th>First Term - 2014</th>
    <th>NA</th>
  </tr>
  <tr>
    <td>Classroom</td>
    <td>2-A</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Tutor</td>
    <td>Julian Anderson</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>First Name</td>
    <td>Last Name</td>
    <td>Grade</td>
  </tr>
  <tr>
    <td>Alan</td>
    <td>Blacksmith</td>
    <td>54</td>
  </tr>
  <tr>
    <td>Andrew</td>
    <td>Cotton</td>
    <td>78</td>
  </tr>
  <tr>
    <td>Anne</td>
    <td>King</td>
    <td>85</td>
  </tr>
  <tr>
    <td>Helga</td>
    <td>Blackwood</td>
    <td>89</td>
  </tr>
  <tr>
    <td>Joshua</td>
    <td>Hagan</td>
    <td>53</td>
  </tr>
  <tr>
    <td>Location</td>
    <td>First Floor</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>NA</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>NA</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>NA</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>NA</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>NA</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>NA</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Classroom</td>
    <td>6-B</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Tutor</td>
    <td>Thomas Rodriguez</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>First Name</td>
    <td>Last Name</td>
    <td>Grade</td>
  </tr>
  <tr>
    <td>Andrew</td>
    <td>Herrera</td>
    <td>77</td>
  </tr>
  <tr>
    <td>Brian</td>
    <td>Paredes</td>
    <td>72</td>
  </tr>
  <tr>
    <td>Mathew</td>
    <td>Hill</td>
    <td>82</td>
  </tr>
  <tr>
    <td>Melanie</td>
    <td>Streme</td>
    <td>87</td>
  </tr>
  <tr>
    <td>Michael</td>
    <td>Blacksmith</td>
    <td>91</td>
  </tr>
  <tr>
    <td>Steven</td>
    <td>Ji</td>
    <td>57</td>
  </tr>
  <tr>
    <td>Thomas</td>
    <td>Doberti</td>
    <td>96</td>
  </tr>
  <tr>
    <td>Location</td>
    <td>Second Floor</td>
    <td>NA</td>
  </tr>
</table>

The objective is to make the data look like this:

<table class="tableizer-table">
  <tr class="tableizer-firstrow">
    <th>First Name</th>
    <th>Last Name</th>
    <th>Grade</th>
    <th>Tutor</th>
    <th>Classroom</th>
    <th>Date</th>
    <th>Location</th>
  </tr>
  <tr>
    <td>Alan</td>
    <td>Blacksmith</td>
    <td>54</td>
    <td>Julian Anderson</td>
    <td>2-A</td>
    <td>First Term - 2014</td>
    <td>First Floor</td>
  </tr>
  <tr>
    <td>Andrew</td>
    <td>Cotton</td>
    <td>78</td>
    <td>Julian Anderson</td>
    <td>2-A</td>
    <td>First Term - 2014</td>
    <td>First Floor</td>
  </tr>
  <tr>
    <td>Anne</td>
    <td>King</td>
    <td>85</td>
    <td>Julian Anderson</td>
    <td>2-A</td>
    <td>First Term - 2014</td>
    <td>First Floor</td>
  </tr>
  <tr>
    <td>Helga</td>
    <td>Blackwood</td>
    <td>89</td>
    <td>Julian Anderson</td>
    <td>2-A</td>
    <td>First Term - 2014</td>
    <td>First Floor</td>
  </tr>
  <tr>
    <td>Joshua</td>
    <td>Hagan</td>
    <td>53</td>
    <td>Julian Anderson</td>
    <td>2-A</td>
    <td>First Term - 2014</td>
    <td>First Floor</td>
  </tr>
  <tr>
    <td>Andrew</td>
    <td>Herrera</td>
    <td>77</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
  <tr>
    <td>Brian</td>
    <td>Paredes</td>
    <td>72</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
  <tr>
    <td>Mathew</td>
    <td>Hill</td>
    <td>82</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
  <tr>
    <td>Melanie</td>
    <td>Streme</td>
    <td>87</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
  <tr>
    <td>Michael</td>
    <td>Blacksmith</td>
    <td>91</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
  <tr>
    <td>Steven</td>
    <td>Ji</td>
    <td>57</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
  <tr>
    <td>Thomas</td>
    <td>Doberti</td>
    <td>96</td>
    <td>Thomas Rodriguez</td>
    <td>6-B</td>
    <td>First Term - 2014</td>
    <td>Second Floor</td>
  </tr>
</table>

As you can see, the objective is to transfer the information located above and below the data of the students as new columns. I am confident that I can achieve this in Excel via the if formula; but I was wondering if the same could be achieved in R; I would be grateful for your help on the coding that I would need for this procedure.

I am confident that the procedure would require me to detect certain strings and extract the data located a certain number of cells from this location and paste them as a new column. Finally deleting the rows that have an NA in the third column.

dena76
  • 17
  • 4
  • Have a look at, e.g., `rvest` for the extraction of tabular data from a html table. – mrub Jan 15 '16 at 07:26
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454?s=1|1.2391#1732454 – IRTFM Jan 15 '16 at 08:48

1 Answers1

0

In fact there is no need to detect strings if one uses rvest as mrub suggested.

I copied your entire original HTML into a character vector I name data (to verify, the number of character inside it is 3998.

I let rvest do the work of converting the HTML table into a dataframe.

library(rvest)
tbl <- html_table(read_html(data))[[1]]

From there it's data manipulation. I identify "Classroom" and "Location" as the start and end boundary of your data chunks, and put each chunk as an element in a list (you can have longer data with this parameter and my approach below will still work).

st <- which(tbl[, 1] == "Classroom")
en <- which(tbl[, 1] == "Location")
list.form <- lapply(1:length(st), function(x) tbl[st[x]:en[x], ])

Then I create a function that can format each chunk into a table in the format similar to your desired final output.

shape.tab <- function(in.df) {
   # data to be put into new columns are those with NA in current 3rd column
   new.cols <- which(is.na(in.df[, 3]))

   # preserve rest of table already in right format
   df <- in.df[-new.cols, ]
   # set row as header name
   names(df) <- df[1, ]
   df <- df[-1, ]

   # create the new columns
   add.on <- matrix(rep(in.df[new.cols, 2], each=nrow(df)), nrow = nrow(df))
   colnames(add.on) <- in.df[new.cols, 1]

   # combine
   return(cbind(df, add.on))
}

Sample output of that function in R

> print(shape.tab(list.form[[1]]))
  First Name  Last Name Grade Classroom           Tutor    Location
4       Alan Blacksmith    54       2-A Julian Anderson First Floor
5     Andrew     Cotton    78       2-A Julian Anderson First Floor
6       Anne       King    85       2-A Julian Anderson First Floor
7      Helga  Blackwood    89       2-A Julian Anderson First Floor
8     Joshua      Hagan    53       2-A Julian Anderson First Floor

Now that it can work for one portion, we apply it on the whole data to form the full final table

df.res <- do.call(rbind, lapply(list.form, shape.tab))

And simply output the dataframe into HTML with help from xtable library

library(xtable)
html.out <- print(xtable(df.res), type = "html",
                  include.rownames=F,
                  html.table.attributes="border=0")

The HTML may not be in exact format you want, but the appearance in a browser will be the same (bar a column or two transposed). If the column sequence matters, you can manipulate the data frame before printing it into HTML.

<!-- html table generated in R 3.2.2 by xtable 1.8-0 package --><!-- Fri Jan 15 17:08:22 2016 --><table border=0><tr> <th> First Name </th> <th> Last Name </th> <th> Grade </th> <th> Classroom </th> <th> Tutor </th> <th> Location </th>  </tr>  <tr> <td> Alan </td> <td> Blacksmith </td> <td> 54 </td> <td> 2-A </td> <td> Julian Anderson </td> <td> First Floor </td> </tr>\n  <tr> <td> Andrew </td> <td> Cotton </td> <td> 78 </td> <td> 2-A </td> <td> Julian Anderson </td> <td> First Floor </td> </tr>\n  <tr> <td> Anne </td> <td> King </td> <td> 85 </td> <td> 2-A </td> <td> Julian Anderson </td> <td> First Floor </td> </tr>\n  <tr> <td> Helga </td> <td> Blackwood </td> <td> 89 </td> <td> 2-A </td> <td> Julian Anderson </td> <td> First Floor </td> </tr>\n  <tr> <td> Joshua </td> <td> Hagan </td> <td> 53 </td> <td> 2-A </td> <td> Julian Anderson </td> <td> First Floor </td> </tr>\n  <tr> <td> Andrew </td> <td> Herrera </td> <td> 77 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>\n  <tr> <td> Brian </td> <td> Paredes </td> <td> 72 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>\n  <tr> <td> Mathew </td> <td> Hill </td> <td> 82 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>\n  <tr> <td> Melanie </td> <td> Streme </td> <td> 87 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>\n  <tr> <td> Michael </td> <td> Blacksmith </td> <td> 91 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>\n  <tr> <td> Steven </td> <td> Ji </td> <td> 57 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>\n  <tr> <td> Thomas </td> <td> Doberti </td> <td> 96 </td> <td> 6-B </td> <td> Thomas Rodriguez </td> <td> Second Floor </td> </tr>   </table>
Ricky
  • 4,616
  • 6
  • 42
  • 72
  • Hello Ricky,Thank you for your answer ! I will give this a try right away;thanks to mrub as well! – dena76 Jan 15 '16 at 23:18
  • Hello everybody; I was wondering if this question is still receiving attention. I was wondering if this methodology would also work when a row appears only a single time for a classroom (For example , lets imagine that the second classroom of the table has an extra row called "Garden" in the first column and "Yes" in the second column. Yet this row doesn't repeat for other classrooms. Thank you very much for your help! – dena76 Jan 20 '16 at 20:58
  • No it won't; it assumes same structure for all classrooms. – Ricky Jan 21 '16 at 14:53
  • Thanks for the help Ricky! – dena76 Jan 22 '16 at 23:08