1

I'm working in a project with a large amount of tables that are stored in an HTML. In the process of scraping I'm having to deal with the following problem.

Some of the tables that I am scraping look like this

I had to input a fill = TRUE parameter in this code for those lines that are merged cells ("chicken" and "chicken without bones"), at the time that I'm importing the DFs:

read_html(link) %>%
    html_nodes(node) %>%
    html_table(fill = T, header = T, dec = ",")

but this generated for me tables like this:

   df <- data.frame(year = c("chicken",2000,2001,2002,"chicken without bones",2003,2004,2005, "chicken without bones and feet", 2006, 2007, 2008), 
                 weight = c("chicken",5,6,4,"chicken without bones",2,1,3,"chicken without bones and feet", 1, 1.5, 2)
                 )

Trying to find a way to make my tables look this way:

df2 <- data.frame(year = c(2000,2001,2002, 2003, 2004, 2005,2006,2007, 2008), number = c(5,6,4,2,1,3,1,1.5, 2), 
                 new_variable = c("chicken","chicken","chicken","chicken without bones","chicken without bones",
                                  "chicken without bones","chicken without bones and feet","chicken without bones and feet","chicken without bones and feet" )
                 )

I'm struggling with R and still have no idea how to do this with my 1.028.974 tables scraped. Obs.: The tables doesn't have a pattern of this occurring; because of that I need a code that identifies the fill nodes, gets their values as characters and turns it into a new column values until the next fill happens.

Thanks for the attention !!

G5W
  • 36,531
  • 10
  • 47
  • 80
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos May 18 '18 at 12:31
  • Oh man thanks for the attention and sorry for my poor english typing. Trying to fix and make a reproducible example ! – Pedro Coutinho May 18 '18 at 13:08
  • This is probably overfitted to the example you provided, but try `cbind(df[c(FALSE, TRUE, TRUE, TRUE),], new_var = rep(as.character(df[c(TRUE, FALSE, FALSE, FALSE),]$year), each = 3))` – Sotos May 18 '18 at 13:55
  • Ohhh, thanks man !! It worked but as you said it is very specific for this example !! I need something for auto recognize the repetition and turn it into a new colunm; cuz i've got a lot of tables, and each one of them has its own format. How could i obtain this vector c(FALSE, TRUE, TRUE, TRUE) ? – Pedro Coutinho May 18 '18 at 14:20

1 Answers1

0

You can try this -

library(dplyr)
library(zoo)

df %>%
  mutate_if(is.factor, as.character) %>%
  mutate(new_variable = ifelse(grepl("\\D+", year), year, NA),
         new_variable = na.locf(new_variable)) %>%
  filter(!grepl("\\D+", year))

Output is:

  year weight                   new_variable
1 2000      5                        chicken
2 2001      6                        chicken
3 2002      4                        chicken
4 2003      2          chicken without bones
5 2004      1          chicken without bones
6 2005      3          chicken without bones
7 2006      1 chicken without bones and feet
8 2007    1.5 chicken without bones and feet
9 2008      2 chicken without bones and feet

Sample data:

df <- structure(list(year = structure(c(10L, 1L, 2L, 3L, 11L, 4L, 5L, 
6L, 12L, 7L, 8L, 9L), .Label = c("2000", "2001", "2002", "2003", 
"2004", "2005", "2006", "2007", "2008", "chicken", "chicken without bones", 
"chicken without bones and feet"), class = "factor"), weight = structure(c(8L, 
6L, 7L, 5L, 9L, 3L, 1L, 4L, 10L, 1L, 2L, 3L), .Label = c("1", 
"1.5", "2", "3", "4", "5", "6", "chicken", "chicken without bones", 
"chicken without bones and feet"), class = "factor")), class = "data.frame", row.names = c(NA, 
-12L))

#                             year                         weight
#1                         chicken                        chicken
#2                            2000                              5
#3                            2001                              6
#4                            2002                              4
#5           chicken without bones          chicken without bones
#6                            2003                              2
#7                            2004                              1
#8                            2005                              3
#9  chicken without bones and feet chicken without bones and feet
#10                           2006                              1
#11                           2007                            1.5
#12                           2008                              2
Prem
  • 11,775
  • 1
  • 19
  • 33