1

I have a file with large range of non-standardised mixed imperial and metric measurements, which I want to standardise and republish.

A sample of that range looks like this:

df  <- data.frame(Measurements =c("1.25m", "2 Feet", "3 Inches", "5.5 cm"))

|Measurements|
|1.25m       |
|2 Feet      |
|3 Inches    |
|5.5 cm      |

which I want to look like this:

|Measurements|MM_Conversion|
|1.25m       |1200mm
|2 Feet      |609.6mm
|3 Inches    |76.2mm
|5.5 cm      |55mm

I can't use measurements::conv_unit or units::set_unit because they both seem to require numeric input values. Is there a straightforward way of doing this which can parse both the value and the string, and convert accordingly?

EDIT 1: Having an issue whereby Conv_Unit can't convert NA values. If the initial vector instead was: df <- data.frame(Measurements =c(NA, 1.25m", "2 Feet", "3 Inches", "5.5 cm")), how would you get around it?

rsylatian
  • 429
  • 2
  • 14
  • One way I can think of is to create a custom function and then using `apply` to convert all the measurements – SmitM Aug 28 '18 at 17:30
  • Thought about that too, but it becomes unwieldy because there are 20 different measurements. – rsylatian Aug 28 '18 at 17:37

2 Answers2

4

it can be (easily) done, but you have to fix the units in your measurements first, since the accepted length-units from measurements::conv_unit

# accepted units
# $length
# [1] "angstrom" "nm"       "um"       "mm"       "cm"       "dm"       "m"        "km"       "inch"     "ft"       "yd"       "fathom"   "mi"       "naut_mi" 
# [15] "au"       "light_yr" "parsec"   "point" 

so, Inches have to become "inch", and "feet" should become "ft" (perform some regex-magic ;-) ).. but then...

library(tidyverse)
df  <- data.frame( Measurements =c( "1.25m", "2 ft", "3 inch", "5.5 cm" ) )

df %>% 
  #extract the numeric and the unit-parts from the string
  mutate( num_part = as.numeric( stringr::str_extract( Measurements, "\\d+\\.*\\d*" ) ), 
          unit_part = stringr::str_extract( Measurements, "[a-zA-Z]+" ) ) %>%
  #perform a rowwise operation
  rowwise() %>% 
  #convert the units to mm, row-by-row
  mutate( in_mm = conv_unit( num_part, unit_part, "mm" ) )

# Source: local data frame [4 x 4]
# Groups: <by row>
#   # A tibble: 4 x 4
#   Measurements num_part unit_part  in_mm
#   <fct>           <dbl> <chr>      <dbl>
# 1 1.25m            1.25 m         1250  
# 2 2 ft             2    ft         610. 
# 3 3 inch           3    inch        76.2
# 4 5.5 cm           5.5  cm          55  
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • +1 Nice use of `rowwise`, but I try to avoid it as much as possible, since it's sort of counter-intuitive in the `dplyr` sense. – acylam Aug 28 '18 at 18:05
  • @avid_useR true, but because `conv_unit` needs "fixed" units, I thought I'd cut corners on this one ;-) – Wimpel Aug 28 '18 at 18:06
  • Elegant answer,never used `rowWise` before. Just so I can understand what's going on, could you explain what exactly is being 'grouped here', and how the 'fixed' units affects the 'conv_unit' function? Commented it out, and finding the error message unhelpful. – rsylatian Aug 28 '18 at 18:53
  • 1
    @rsylatian `rowwise` performs the operation on row-by-row... this is necessary, because `conv_unit(x, from, to)` needs the `from` to be a string, not a vector. In the answer from @avid_useR, this issue is resolved using `map`. In this case, every row (because I set `rowwise()` ), `conv_unit` uses the parameters `x = num_part`, and `from = unit_part` from that specific row. – Wimpel Aug 29 '18 at 04:48
3

We can use extract from tidyr to separate the value and unit and feed that into conv_unit using map2:

df <- data.frame(Measurements =c(NA, "1.25m", "2 Feet", "3 Inches", "5.5 cm"))

library(tidyverse)
library(stringr)
library(measurements)

df %>%
  extract(Measurements, c("value", "unit"), 
          regex = "^([\\d.]+)\\s*([[:alpha:]]+)$", 
          remove = FALSE, convert = TRUE) %>%
  mutate(unit = str_replace_all(unit, c(Feet="ft", Inches="inch")),
         MM_Conversion = paste0(map2(value, unit, ~if(!is.na(.x)) conv_unit(.x, .y, "mm") else NA), "mm"))

Result:

  Measurements value unit MM_Conversion
1         <NA>    NA <NA>          NAmm
2        1.25m  1.25    m        1250mm
3       2 Feet  2.00   ft       609.6mm
4     3 Inches  3.00 inch        76.2mm
5       5.5 cm  5.50   cm          55mm

or use filter if NAs should not appear in the final output:

df %>%
  extract(Measurements, c("value", "unit"), 
          regex = "^([\\d.]+)\\s*([[:alpha:]]+)$", 
          remove = FALSE, convert = TRUE) %>%
  filter(!is.na(Measurements)) %>%
  mutate(unit = str_replace_all(unit, c(Feet="ft", Inches="inch")),
         MM_Conversion = paste0(map2(value, unit, ~conv_unit(.x, .y, "mm")), "mm"))

Result:

  Measurements value unit MM_Conversion
1        1.25m  1.25    m        1250mm
2       2 Feet  2.00   ft       609.6mm
3     3 Inches  3.00 inch        76.2mm
4       5.5 cm  5.50   cm          55mm

Notice how I manually abbreviated the original units to make conv_unit work. It would be one step less if the original units were already in abbreviated form.

acylam
  • 18,231
  • 5
  • 36
  • 45
  • Elegant answer too, but what is the map2 function doing here? – rsylatian Aug 28 '18 at 18:55
  • 1
    the `conv_unit` function is not vectorized, so we need to map each element of `value` and the corresponding `unit` to `conv_unit`. – acylam Aug 28 '18 at 19:00
  • Apologies for pushing my luck a bit, but say for example instead it was `Measurements = c(NA, "1.25m", "2 Feet", "3 Inches", "5.5 cm")` , the answer breaks. Found a non-piped [solution](https://stackoverflow.com/a/45288319/5195054), but can't get it to work with your answer. Is there any way of incorporating `not_na <- !is.na(df$Measurements)` into it? – rsylatian Aug 29 '18 at 13:29
  • @rsylatian Thanks for pointing that out. Do you want any `NA` rows to stay `NA`, or just simply be removed? – acylam Aug 29 '18 at 13:38
  • 1
    @rsylatian See my update. It should now work with `NA` values. If instead you want to remove them from the output, there is another fix. – acylam Aug 29 '18 at 13:43