2

Scenario:

If I have this table, let's call it df:

survey_answer_1___1 survey_answer_1___2 survey_answer_1___3 survey_answer_2___1 survey_answer_2___2
1 1 0 1 0
0 1 0 0 0
0 0 0 1 0
1 1 1 0 0

Using R or Python, how do I split and transform df into survey_answer_1 and survey_answer_2 like this:

survey_answer_1:

1 2 3
2 3 1

survey_answer_2:

1 2
2 0

Where the column names of the new tables are extracted from df column names after '___'. The values in the new cells is the count of 1s in each column in df. This should be done automatically (tables should not be "hard-coded"), as there are many other columns in my data file that this should be applied on as well.

split() can be used to extract the numbers after '___' for column names. I tried implementing the rest using a dictionary, but it is not working.

user123
  • 307
  • 7
  • `x <- colSums(df); split(setNames(x, gsub('(\\d+)$|.', '\\1', names(x))), gsub('_{2,}\\d+', '', names(x)))` – rawr Dec 27 '22 at 22:54

5 Answers5

3

Here an R example where the new columns can be arbitrary values

df <- as.data.frame(matrix(c(1,0,0,1,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0), 4, 5, dim=list(
  1:4, paste0("survey_answer_", c(1,1,1,2,2), "__", c(1,2,3,1,5)) )))

df 
#>   survey_answer_1__1 survey_answer_1__2 survey_answer_1__3 survey_answer_2__1
#> 1                  1                  1                  0                  1
#> 2                  0                  1                  0                  0
#> 3                  0                  0                  0                  1
#> 4                  1                  1                  1                  0
#>   survey_answer_2__5
#> 1                  0
#> 2                  0
#> 3                  0
#> 4                  0

var <- Map(c, names(df), strsplit(names(df), "__"))

result <- tapply(var, sapply(var,"[", 2), \(x) 
       setNames(colSums(df[sapply(x,"[",1)]) , sapply(x,"[",3)))

#to assign the resuilt list to new  datafrae variables:
list2env(result, environment())


survey_answer_1
#> 1 2 3 
#> 2 3 1
survey_answer_2
#> 1 5 
#> 2 0
Ric
  • 5,362
  • 1
  • 10
  • 23
  • Is it possible to add a column to each table that subtracts row sum from 500 (500 - sum of values in row)? – user123 Dec 28 '22 at 17:04
  • 1
    The results are vectors, so to convert to data frame and add 500-sum column you can do: `result <- tapply(var, sapply(var,"[", 2), \(x) setNames(colSums(df[sapply(x,"[",1)]) , sapply(x,"[",3))) |> lapply(\(x) data.frame(as.list(c(x, "500 - sum" = sum(x))), check.names = F )) ` – Ric Dec 28 '22 at 17:52
2

Using R / tidyverse, first dplyr::summarize() all columns to sums; then tidyr::pivot_longer(); then split() by survey_answer; then purrr::map() over the resulting list to drop all-NA columns :

library(dplyr)
library(tidyr)
library(purrr)

survey_dfs <- df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  map(\(d) select(d, where(\(col) !all(is.na(col))) & !survey_answer))

survey_dfs 
$survey_answer_1
# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1

$survey_answer_2
# A tibble: 1 × 2
    `1`   `2`
  <dbl> <dbl>
1     2     0

This gives you a named list of dataframes, which is best practice in most cases. If you really want the resulting dataframes loose in the global environment, you can replace the map() call with an assign() call within purrr::iwalk():

df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  iwalk(\(d, dname) {
    d <- select(d, where(\(col) !all(is.na(col))) & !survey_answer)
    assign(dname, d, pos = 1)
  })

survey_answer_1
# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • What does the \ in `map(\(d)` do? Your code seems to works except for the drop `NA` columns? – user123 Dec 28 '22 at 15:36
  • 1
    `\(x)` is anonymous function shorthand in R -- it's equivalent to `lambda x:` in Python. The code seems to drop `NA` columns with your example data; can you update your question with data that reproduces the problem you're having? – zephryl Dec 28 '22 at 15:51
  • I'm unfortunately unable to provide actual data, but the error is: "unexpected input..." and the arrow (`^`) points at the \ – user123 Dec 28 '22 at 15:53
  • 1
    What version of R are you using? The `\()` shorthand was introduced in 4.1.0. You could update, or change to the longhand version, i.e., `function(d)` instead of `\(d)`, and the same for `\(col)`. – zephryl Dec 28 '22 at 15:58
  • Thank you, I did the change and it works perfectly! I'm using version 4.0.x – user123 Dec 28 '22 at 16:00
0

For answer 1, you could do the following:

# grab correct columns
df_answer_1 = df[[col for col in df.columns if col.startswith('survey_answer_1')]] 

# change column names
df_answer_1.columns = [col[-1] for col in df_answer_1.columns]

# sum up columns
answer_1_sums = df_answer_1.sum()

You can do the same for answer 2.

jprebys
  • 2,469
  • 1
  • 11
  • 16
0

Assuming data is in csv:

survey_answer_1___1,survey_answer_1___2,survey_answer_1___3,survey_answer_2___1,survey_answer_2___2
1,1,0,1,0
0,1,0,0,0
0,0,0,1,0
1,1,1,0,0

Read data:

import csv

with open('input.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    df = [row for row in reader]

Process data:

from collections import defaultdict, Counter

dd = defaultdict(Counter)
for row in df:
    for k, v in row.items():
        key1, key2 = k.split('___')
        dd[key1][int(key2)] += int(v)

Print result:

for k in dd:
    print(k, sorted(dd[k].items()))
fen1x
  • 5,616
  • 6
  • 28
  • 39
0

In Python:

# raw data
df = {"survey_answer_1___1":[1,0,0,1], "survey_answer_1___2":[1,1,0,1], "survey_answer_1___3":[0,0,0,1], "survey_answer_2___1":[1,0,1,0], "survey_answer_2___2":[0,0,0,0]}
# sum up the answers
for k in df:
    sum_df[k] = sum(df[k])
# extract answer_1
survey_answer_1 = {[k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_1")]}
survey_answer_1
{'1': 2, '2': 3, '3': 1}
# extract answer_2
survey_answer_2 = {k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_2")}
survey_answer_2
{'1': 2, '2': 0}
Paul Smith
  • 454
  • 6
  • 11