Question about removing certain pieces of information in cells in R df?

Question

I have a df in R containing data on the voting behavior of political parties in the Russian Duma. See the attached photo.

Each column currently contains the percentage and number of votes. So, for example, in the first row of the first column, UR_yes, we see 95.8% and 228 гол. (that is, 228 votes in English). In each column I want the latter figure without гол. So, for example, each cell should just contain one number. Using the first column as an example, this would look like 228 in the first cell, 234 in the second, 235 in the third, and so on. I am dealing with a lot of entries (~15,000 across 15 separate df), so that means the editing this by hand in Excel will be difficult. Is there a way to automate this process in R? Any advice would be appreciated.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please do no share data as images otherwise we need to retype everything to test out any possible suggestion. — MrFlick, Dec 06 '21 at 04:17
Can you share as code in the text of your question the output of `dput(head(YOUR_TABLE))`? That will generate code that creates a perfect replica of the first 6 rows of your data, plenty for people to test potential solutions on without making them re-type the data and make guesses about your data format. — Jon Spring, Dec 06 '21 at 06:21

score 3 · Answer 1 · answered Dec 06 '21 at 04:30

You could use separate() from tidyr

library(tibble)
library(tidyr)

dat <- tibble::tribble(
    ~ UR_yes,
    "95.8%, 228 гол.",
    "98.3%, 234 гол.",
    "98.7%, 235 гол."
)


dat %>% 
    tidyr::separate(UR_yes, into = c("perc_votes", "num_votes"), sep="[%, ]+")
# A tibble: 3 × 2
    perc_votes num_votes
    <chr>      <chr>    
  1 95.8       228      
  2 98.3       234      
  3 98.7       235

score 0 · Answer 2 · answered Dec 06 '21 at 06:25

Applying similar syntax as @Josh Gray, you might sandwich that with between pivot_longer / pivot_wider to deal with the 15 columns. Here's a sample with two:

dat <- tibble::tribble(
  ~ UR_yes, ~UR_no, 
  "95.8%, 228 гол.",   "5.8%, 28 гол.",   # not working #s just saving time
  "98.3%, 234 гол.",   "8.7%, 35 гол.",
  "98.3%, 234 гол.",   "8.7%, 35 гол." 
)

library(tidyverse)
dat %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row) %>%
  separate(value, c("perc", "num"),  sep="[%, ]+", extra = "drop") %>%
  select(row, name, num) %>%
  pivot_wider(names_from = name, values_from = num)

Result:

# A tibble: 3 x 3
    row UR_yes UR_no
  <int> <chr>  <chr>
1     1 228    28   
2     2 234    35   
3     3 234    35

(There's probably a very concise way to do this in base r using regular expressions across the whole table.)

Question about removing certain pieces of information in cells in R df?

2 Answers2