0

My goal is to change the format of values within a df.

The df has around 1500 columns. There are multiple formats that are unusable, i.e. feet & inches, $ signs, + & - prefixes, commas, etc.

EX:

 x         y        z        q
+25.00    75'6"    1,000    $1,500,000
-10.00    67'10"   5,000    $300,000
+09.11    72'3"    1,500    $500,000
+01.37    65'9"    27,000   $3,000,000


I have failed spectacularly when attempting to write code for this. Do I need a format function & a do loop to run over the df? I'd like to return all of these values as type numeric with no extra characters.

I appreciate any help. Thank you.

  • 1
    So you want to remove all the symbols? How will you know what each means? – iod Nov 26 '19 at 15:59
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Show what you attempted to do; it's easier for us to start with something. – MrFlick Nov 26 '19 at 16:00
  • iod - Hi sorry I should have been more clear. The variables where this matters is the + & -. If the "-" is the prefix, the numeric value should be negative. – Vinnie Palazeti Nov 26 '19 at 16:20

1 Answers1

0

You can gsub everything that isn't a number (using [:digit:], and adding period and minus to keep that) and replace it with nothing. Pass that through a as.numeric and put it all into a sapply to get it back as a matrix:

sapply(data, function(x) as.numeric(gsub("[^[:digit:].-]","",x)))
          x    y     z       q
[1,]  25.00  756  1000 1500000
[2,] -10.00 6710  5000  300000
[3,]   9.11  723  1500  500000
[4,]   1.37  659 27000 3000000

(if you just do the gsub, without sapply, you get back each row as a single string of numbers. There may be a better way to avoid that, but I'm not sure what it is.)

Following suggestions from Gregor, here's a variant of this solution where I replace foot-inch formats with a decimal point for better readbility:

sapply(data, function(x) {x<-gsub("'(\\d*)''",".\\1",x)
     as.numeric(gsub("[^[:digit:].-]","",x))})
          x    y     z       q
[1,]  25.00 75.6  1000 1500000
[2,] -10.00 67.1  5000  300000
[3,]   9.11 72.3  1500  500000
[4,]   1.37 65.9 27000 3000000

(note that in my data, the inch symbol was replaced with '' (two apostrophes) -- you'll need to replace it with whatever your data has there.)

One last option, where I change the feet and inches into cm, to make it decimal:

sapply(data, function(x) {
           if(any(grepl("'",x))) {inches<-strsplit(x,split="\\'")
             x<-unlist(lapply(inch, function(y) as.numeric(y[1])*30.48+as.numeric(y[2])*2.54))
             x}
         as.numeric(gsub("[^[:digit:].-]","",x))
        }
       )

          x       y     z       q
[1,]  25.00 2301.24  1000 1500000
[2,] -10.00 2067.56  5000  300000
[3,]   9.11 2202.18  1500  500000
[4,]   1.37 2004.06 27000 3000000
iod
  • 7,412
  • 2
  • 17
  • 36
  • 2
    Do note that this may have unintended information loss applied indiscrimately. Columns `z` and `q` looks good. Column `y` is pretty useless. Column `x` is the absolute value of its original meaning. Custom solutions would be needed to retain all the information. – Gregor Thomas Nov 26 '19 at 16:05
  • 1
    My comment is for OP's consideration as well as for anyone else reading this question and seeing this answer. It's possible OP hasn't thought through what will be useful in the next steps of their analysis. – Gregor Thomas Nov 26 '19 at 16:09
  • Gregor - I pointed out the same thing in my comment to the OP under his question. Also, I revised it a bit to get at least the periods and minuses kept – iod Nov 26 '19 at 16:11
  • Thank you for the replies & the help. Is there a method to transform this data into meaningful information? If not, is there a method to drop all columns with variables containing ' or " ? – Vinnie Palazeti Nov 26 '19 at 16:16
  • I added a variant of my solution that transforms foot-inch format into a decimal format (i.e., 6'2" becomes 6.2). Does that solve your problem? – iod Nov 26 '19 at 16:17
  • An alternative will be to convert feet/inches to centimeters. Would that be helpful? – iod Nov 26 '19 at 16:20
  • iod - I think 6.2 would work fine! Thank you I really appreciate it – Vinnie Palazeti Nov 26 '19 at 16:23
  • Also, keep in mind that this will mess with any sorting you might do -- 6.2 is bigger than 6.12, but 6'2" is smaller than 6'12" – iod Nov 26 '19 at 16:31
  • One last point: note that 6'10" becomes 6.1, just like 6'1". I'm not sure how you can get around this... – iod Nov 26 '19 at 16:42
  • @iod - maybe it would be best to transform the inches into centimeters. Can this be done within the same function? – Vinnie Palazeti Dec 01 '19 at 03:16
  • @VinniePalazeti sorry about the delay - didn't see your comment. Added a solution including changing inches into cm. – iod Dec 04 '19 at 03:07
  • Really appreciate all of the help. Hope all is well. Thank you – Vinnie Palazeti Dec 04 '19 at 04:48