How do I clean this dataframe

Question

I have data that has variables stored in the following format:

             V2                             V3
1 Price :  33,990          Size : 16, 17 & 18.5"
2 Price :  30,830      Size : 13, 16, 18 & 19.5"
3 Price :  48,560             Sizes : 21 & 21.5"
4 Price :  33,790 Size : 17.5, 18.5, 19.5 & 21.5
5 Price :  37,990       Size : 17.5, 18.5 & 19.5
6 Price :  43,690      Size : 17.5, 18.5 & 19.5"

The variables I need are Price and Size among others. What is the cleanest way in R to convert this raw data into a format that looks like this:

            Price        Size
1          33,990       16, 17 & 18.5"
2          30,830       13, 16, 18 & 19.5"
3          48,560       21 & 21.5"
4          33,790       17.5, 18.5, 19.5 & 21.5
5          37,990       17.5, 18.5 & 19.5
6          43,690       17.5, 18.5 & 19.5"

Also, the variable name for the third row is misspelled as Sizes instead of Size. How can I handle this problem as there are other variables with the same error?

Edit: I cannot use a column specific strategy (for eg. using gsub()) as the variables in a given column are inconsistent. Concretely,

                                           V20
1                        Grips : Bontrager SSR
2                  Headset : 1-1/8" threadless
3                                             
4          Brakeset : Tektro alloy linear-pull
5            Brakeset : HL 280 mechanical disc
6 Brakeset : Tektro M290 hydraulic disc brakes

column V20 has 3 unique variables, Grips, Headset, Brakeset and a blank. The tidy dataframe should look something like:

           Grips        Headset              Brakeset
1   Bontrager SSR       NA                   NA
2              NA       1-1/8" threadless    NA
3              NA       NA                   NA
4              NA       NA                   Tektro alloy linear-pull
5              NA       NA                   HL 280 mechanical disc
6              NA       NA                   Tektro M290 hydraulic disc brakes

This is an oversimplification as I have assumed that Brakeset has no value for the first 3 rows. This may or may not be the case as the value may be stored in a different column. If a particular row has no value for a given variable, NA's are to be used. I hope the question is clear.

You can remove "Price : " and "Size : " (is Sizes a typo?) and then rename the columns. You may also want to replace "," with ".". `gsub` can be your workhorse here. — Roman Luštrik, Dec 02 '16 at 12:25
If price and size are inconsisted then `strsplit(x, split = ':')[[1]][2]` would work. — Andrey Kolyadin, Dec 02 '16 at 12:28
@RomanLuštrik Another problem that I forgot to mention is the a given column might contain multiple variable names. For instance the example shown has 6 rows with the price and the Size variable. However, it possible that another row has another variable X instead of Size in column V5. — Green Noob, Dec 02 '16 at 12:30
Possible duplicate of [Replace multiple arguments with gsub](http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub) — Hack-R, Dec 02 '16 at 12:33
@GreenNoob even gsub() shall work in this scenario. ".*" shall take care of it. as in my answer — joel.wilson, Dec 02 '16 at 12:39
If you want to keep the variable information (the stuff before ":"), you should probably use `strsplit`. If you don't care about keeping it, the suggested `gsub` methods will more or less get the job done. — lmo, Dec 02 '16 at 12:47
Oversimplifying your example is just as bad as undersimplifying. To receive the right help, please provide a full example with multiple columns showing the variety of data that you will actually encounter. — Pierre L, Dec 02 '16 at 13:24

joel.wilson · Answer 1 · 2016-12-02T13:26:23.763

3

library(tidyr)
# convert = T automatically converts to integer/numeric
df$Price <- separate(df, Price, into = c("x","y"), sep = ":", convert = T)[,2]
df$Size  <- separate(df, Size, into = c("x","y"), sep = ":")[,2]

# with gsub()
# irrespective of what is appearing before ":", gsub() shall take care of it
df$Price <- trimws(gsub(".*\\:", "",df$Price)) # this should work

# I'm using the below data to explain. This is obtained after using separate() once.
df1
          x                                  y
1    Grips                       Bontrager SSR
2    Grips                       Bontrager SSR
3  Headset                    1-1/8 threadless
4 Brakeset   Tektro M290 hydraulic disc brakes

# need to add a unique key to the data
> df1[["id"]] <- 1:nrow(df1)
> df1
          x                                  y id
1    Grips                       Bontrager SSR  1
2    Grips                       Bontrager SSR  2
3  Headset                    1-1/8 threadless  3
4 Brakeset   Tektro M290 hydraulic disc brakes  4

# using spread() from tidyr package
> spread(df1, x, y)
  id                          Brakeset          Grips           Headset 
1  1                               <NA>  Bontrager SSR              <NA>
2  2                               <NA>  Bontrager SSR              <NA>
3  3                               <NA>           <NA>  1-1/8 threadless
4  4  Tektro M290 hydraulic disc brakes           <NA>              <NA>

edited Dec 02 '16 at 13:26

answered Dec 02 '16 at 12:32

joel.wilson

8,243
5
28
48

I have 97 unique variables. Is there a better way to do this? Also, how do I ensure that values are assigned to the correct variable? For example, say I have a row in a given column with the value `Size : 60` and the next row with the value `Headset : Threadless`. I need the Size variable in the first row to equal 60 and the Size variable in the second row to equal NA. – Green Noob Dec 02 '16 at 12:50
1

got it...i shall update my answer with my undersstanding – joel.wilson Dec 02 '16 at 13:23
Thanks. What is the use of the id field? Also, will spread work if the variable is stored in different columns? For instance, say that the 'Headset' variable is stored in column V19 of row 10. However, the 'Headset' variable for row 11 is stored in column V15. Will the respective values for Headset be captured? – Green Noob Dec 02 '16 at 13:37
too broad question.. without a sample of your data I can't comment. I have given the different options who have. You can definitely try them out and later let me know – joel.wilson Dec 02 '16 at 13:43
Did it help with your question? Mind accepting the answer? – joel.wilson Dec 04 '16 at 20:21

Jaap · Answer 2 · 2016-12-03T18:42:51.933

A possible alternative approach:

# create a list of the needed columnnames for the desired dataframe
nameslist <- lapply(mydf, function(x) unique(trimws(gsub('\\:.*', '', x[x != '']))))

# create a new dataframe with this list
mydf2 <- mydf[, rep(names(mydf), lengths(nameslist))]
names(mydf2) <- unlist(nameslist)

# create an array index of which values need to be included
idx <- mapply(function(x,y) grepl(x, y), x = names(mydf2), y = mydf2)

# replace the other values with 'NA'
mydf2[!idx] <- NA

# use gsub to remove everything before ':'
mydf2[] <- lapply(mydf2, function(x) trimws(gsub('.*\\:', '', x)))

which gives:

   Price                    Size         Grips          Headset                          Brakeset
1 33,990           16, 17 & 18.5 Bontrager SSR             <NA>                              <NA>
2 30,830       13, 16, 18 & 19.5          <NA> 1-1/8 threadless                              <NA>
3 48,560               21 & 21.5          <NA>             <NA>                              <NA>
4 33,790 17.5, 18.5, 19.5 & 21.5          <NA>             <NA>          Tektro alloy linear-pull
5 37,990       17.5, 18.5 & 19.5          <NA>             <NA>            HL 280 mechanical disc
6 43,690       17.5, 18.5 & 19.5          <NA>             <NA> Tektro M290 hydraulic disc brakes

Used data:

mydf <- structure(list(V2 = c("Price :  33,990", "Price :  30,830", "Price :  48,560", "Price :  33,790", "Price :  37,990", "Price :  43,690"), 
                       V3 = c("Size : 16, 17 & 18.5", "Size : 13, 16, 18 & 19.5", "Size : 21 & 21.5", "Size : 17.5, 18.5, 19.5 & 21.5", "Size : 17.5, 18.5 & 19.5", "Size : 17.5, 18.5 & 19.5"), 
                       V4 = c("Grips : Bontrager SSR", "Headset : 1-1/8 threadless", "", "Brakeset : Tektro alloy linear-pull", "Brakeset : HL 280 mechanical disc", "Brakeset : Tektro M290 hydraulic disc brakes")), 
                  .Names = c("V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -6L))

Thanks a lot. I have assumed that the parameter lst is names(mydf2) in the mapply function. I hope my understanding is correct. How do I deal with duplicates in the namelist? Is there a way to combine all variables with identical names into one variable? (akin to the aggregate function for rows) — Green Noob, Dec 03 '16 at 06:03
`lst` is indeed the same as `names(mydf2)` (forgot to change that in the last iteration of my code); updated my answer — Jaap, Dec 03 '16 at 18:51

How do I clean this dataframe

2 Answers2