5

I have a CSV datafile called test_20171122

image of dataset

Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file.

I am looking into the optimal way to clean data from an accounting format "$##,###" to a number "####" in R using gsub().

My trouble is in the iteration of gsub() across all columns of a dataset. My first instinct run gsub() on the whole dataframe (below) but it seems to alter the data in a counterproductive way.

gsub("\\$", "", test_20171122)

The following code is a for loop that seems to get the job done.

for (i in 1:length(test_20171122)){
clean1 <- gsub("\\$","",test_20171122[[1]])
clean2 <- gsub("\\,","",clean1)
test_20171122[,i] <- clean2
i = i + 1 
}

I am trying to figure out the optimal way of cleaning a dataframe using gsub(). I feel like sapply() would work but it seems to break the structure of the dataframe when I run the following code:

test_20171122 <- sapply(test_20171122,function(x) gsub("\\$","",x))
test_20171122 <- sapply(test_20171122,function(x) gsub("\\,","",x))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Brandon
  • 59
  • 1
  • 1
  • 3
  • 5
    `dat[] <- lapply(dat, function(x) as.numeric(gsub("[$,]", "", x)) )` I think. `lapply` will be better because it won't try to force everything into a matrix output. – thelatemail Nov 22 '17 at 22:56
  • Just remember that `sapply` `s`implifies, `l`apply retains a `l`ist structure. And since `data.frame`s are just fancy `list`s, you are usually better off with `lapply` when working with a `data.frame`. – thelatemail Nov 22 '17 at 23:06
  • Probably a duplicate of this question https://stackoverflow.com/a/32625825/496803 , though the suggestion to use `apply(dat, 2, ...)` in the accepted answer is not what I would do. – thelatemail Nov 22 '17 at 23:15
  • 1
    Forgive me for making a duplicate post :(. Just ran the following code: `test_20171122 <- data.frame(lapply(test_20171122, function(x) as.numeric(gsub("[$,]", "", x))))`. It worked like a charm! Thanks @thelatemail – Brandon Nov 22 '17 at 23:22
  • 4
    You don't necessarily need the `data.frame` in there. Just `test[] <- lapply(test, FUN)` will do it. Note the `[]` which will just mean you're overwriting the contents of the structure of `test`, without any conversion. – thelatemail Nov 22 '17 at 23:24
  • @Brandon, for future reference sharing data as an image is discouraged; try to use `dput()` or some other strategy to share your data in a way that can be easily used by those trying to help you. Read [here](https://stackoverflow.com/help/mcve) and [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-aka-mcve-minimal-complete-and-ver) to learn more about minimal reproducible examples. – duckmayr Nov 22 '17 at 23:25

2 Answers2

8

You can use the following pattern in gsub: "[$,]"

Example:

df <- data.frame(
  V1 = c("$1,234.56", " $ 23,456.70"),
  V2 = c("$89,101,124", "15,234")
)
df
#             V1          V2
# 1    $1,234.56 $89,101,124
# 2  $ 23,456.70      15,234

df[] <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
df
#         V1       V2
# 1  1234.56 89101124
# 2 23456.70    15234
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 2
    Two questions: 1) Why the "function(x)" ?? also 2) Is there a reason to write " df[] " instead of just "df" ? – Emil Krabbe Mar 23 '21 at 12:49
0

A solution using the purrr function map_df :

clean_df <- map_df(test_20171122, ~ gsub("[$,]", "", .x))
jayb
  • 555
  • 3
  • 15