0

I'm just learning R (and dplyr), and have what is likely a simple problem, though I have spent hours researching with no answer. I have two columns, A (character) and B (integer), in a data frame.

The data frame contains additional rows due to poor spelling in the original data (in column A) and so there are many rows that must be collapsed into a single row (whilst also collapsing their integer values (via sum()) in column B).

For example, there may be three rows that should instead be one:

CAR.............MPG

Mazda.........5

Mazzda...... 2

Mzda.......... 1

Should be

CAR...........MPG

Mazda........ 8

I'm using dplyr, and trying to find the manner in which I can collapse rows with similar characters (e.g. three letters), and pipe back into the original table.

Any thoughts or directions, preferably using base R or dplyr would be appreciated.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 1
    This will help you get an appropriate answer fast. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Hector Haffenden Oct 20 '18 at 16:39

1 Answers1

1

Using the input DF shown reproducibly in the Note at the end, use soundex or one of the other functions in the phonics package to derive a key for each CAR and then summarize by key:

library(dplyr)
library(phonics)

DF %>% 
  group_by(key = soundex(CAR)) %>%
  summarize(CAR = toString(CAR), MPG = sum(MPG)) %>%
  ungroup %>%
  select(-key)

giving:

# A tibble: 1 x 2
  CAR                   MPG
  <chr>               <int>
1 Mazda, Mazzda, Mzda     8

Note

Lines <- "CAR MPG
Mazda 5
Mazzda 2
Mzda 1"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)
Mr. T
  • 11,960
  • 10
  • 32
  • 54
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341