Using gsub() on a dataframe

Question

I have a CSV datafile called test_20171122

image of dataset

Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file.

I am looking into the optimal way to clean data from an accounting format "$##,###" to a number "####" in R using gsub().

My trouble is in the iteration of gsub() across all columns of a dataset. My first instinct run gsub() on the whole dataframe (below) but it seems to alter the data in a counterproductive way.

gsub("\\$", "", test_20171122)

The following code is a for loop that seems to get the job done.

for (i in 1:length(test_20171122)){
clean1 <- gsub("\\$","",test_20171122[[1]])
clean2 <- gsub("\\,","",clean1)
test_20171122[,i] <- clean2
i = i + 1 
}

I am trying to figure out the optimal way of cleaning a dataframe using gsub(). I feel like sapply() would work but it seems to break the structure of the dataframe when I run the following code:

test_20171122 <- sapply(test_20171122,function(x) gsub("\\$","",x))
test_20171122 <- sapply(test_20171122,function(x) gsub("\\,","",x))

`dat[] <- lapply(dat, function(x) as.numeric(gsub("[$,]", "", x)) )` I think. `lapply` will be better because it won't try to force everything into a matrix output. — thelatemail, Nov 22 '17 at 22:56
Just remember that `sapply` `s`implifies, `l`apply retains a `l`ist structure. And since `data.frame`s are just fancy `list`s, you are usually better off with `lapply` when working with a `data.frame`. — thelatemail, Nov 22 '17 at 23:06
Probably a duplicate of this question https://stackoverflow.com/a/32625825/496803 , though the suggestion to use `apply(dat, 2, ...)` in the accepted answer is not what I would do. — thelatemail, Nov 22 '17 at 23:15
Forgive me for making a duplicate post :(. Just ran the following code: `test_20171122 <- data.frame(lapply(test_20171122, function(x) as.numeric(gsub("[$,]", "", x))))`. It worked like a charm! Thanks @thelatemail — Brandon, Nov 22 '17 at 23:22
You don't necessarily need the `data.frame` in there. Just `test[] <- lapply(test, FUN)` will do it. Note the `[]` which will just mean you're overwriting the contents of the structure of `test`, without any conversion. — thelatemail, Nov 22 '17 at 23:24
@Brandon, for future reference sharing data as an image is discouraged; try to use `dput()` or some other strategy to share your data in a way that can be easily used by those trying to help you. Read [here](https://stackoverflow.com/help/mcve) and [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-aka-mcve-minimal-complete-and-ver) to learn more about minimal reproducible examples. — duckmayr, Nov 22 '17 at 23:25

score 8 · Answer 1 · answered Feb 20 '18 at 17:14

8

You can use the following pattern in gsub: "[$,]"

Example:

df <- data.frame(
  V1 = c("$1,234.56", " $ 23,456.70"),
  V2 = c("$89,101,124", "15,234")
)
df
#             V1          V2
# 1    $1,234.56 $89,101,124
# 2  $ 23,456.70      15,234

df[] <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
df
#         V1       V2
# 1  1234.56 89101124
# 2 23456.70    15234

answered Feb 20 '18 at 17:14

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

2

Two questions: 1) Why the "function(x)" ?? also 2) Is there a reason to write " df[] " instead of just "df" ? – Emil Krabbe Mar 23 '21 at 12:49

score 0 · Answer 2 · answered Jan 13 '21 at 12:26

0

A solution using the purrr function map_df :

clean_df <- map_df(test_20171122, ~ gsub("[$,]", "", .x))

answered Jan 13 '21 at 12:26

jayb

555
3
15

Using gsub() on a dataframe

2 Answers2