Difficulty converting wide format to tidy format in dataset

Question

I am using Kaggles gun violence dataset. My goal is to use Tableau for a interactive visualization for some of the regions and specifics relating to gun crimes there. My goal is to turn this dataframe into tidy format. Link:

https://www.kaggle.com/jameslko/gun-violence-data/version/1

With that being the case, there are a couple columns formatted like this that I am having issues wrangling in R. There are around 20 or so columns, these 4 are formatted like this:

A little background: there can be more than one gun involved in a crime, and more than one participant. Due to this, these columns contain information for each gun/participant split by '||'. The 0:, 1: ... indicates details for that specific gun/participant.

My goal is to capture the unique instances in each column and disregard the 0:, 1:, 2:, ...

Here is my code so far:

df= read.csv("C:/Users/rmahesh/Desktop/gun-violence-data_01-2013_03-2018.csv")
df$incident_id = NULL
df$incident_url = NULL
df$source_url = NULL
df$participant_name = NULL
df$participant_relationship = NULL
df$sources = NULL
df$incident_url_fields_missing = NULL
df$participant_status = NULL
df$participant_age_group = NULL 
df$participant_type = NULL
df$incident_characteristics = NULL

#Subset of columns with formatting issues:
df2 = df[, c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')]

I have yet to run into an issue like this, and would love any help figuring out how to solve my problem. Any help would be greatly appreciated!

Edit1: I have created the first 3 rows of the columns in question. The format is identical more or less with some columns missing at times:

gun_stolen,gun_type,participant_age,participant_gender
0::Unknown||1::Unknown, 0::Unknown||1::Unknown, 0::25||1::31||2::33||3::34||4::33, 0::Male||1::Male||2::Male||3::Male||4::Male
0::Unknown||1::Unknown,0::22 LR||1::223 Rem [AR-15],0::51||1::40||2::9||3::5||4::2||5::15,0::Male||1::Female||2::Male||3::Female||4::Female||5::Male
0::Unknown,0::Shotgun,3::78||4::48,0::Male||1::Male||2::Male||3::Male||4::Male

Please post a sample of the data, rather than a screenshot and link for us to download the whole thing. Also, you've tagged this `tidyverse` but aren't using any `tidyverse` functions—what's the plan there? — camille, Jul 18 '18 at 16:18
Also, from the `[tidyverse]` wiki: "DO NOT USE if your question relates to one or two components of the tidyverse, such as dplyr or ggplot2. Use *those* tags, and tag with `r` as well for a better response." I'm editing accordingly. — camille, Jul 18 '18 at 16:19
@camille Thanks for getting back. I removed tidyverse. Now to post the sample of the data, how should I go about doing this? — rmahesh, Jul 18 '18 at 16:29
Besides a sample of the data, you should also show the corresponding output you expect/want. (I know what tidy data means to me, but interpretations differ and generally this is an important thing to include in a question. Some general guidance on asking R questions: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250) — Frank, Jul 18 '18 at 16:34
@Frank Thanks for responding. The output as I stated would be to have all the unique occurrences in a row. Conceptually, I am having a difficult time figuring out how to do this. As for data, how should I go about sending that? — rmahesh, Jul 18 '18 at 16:39
1. There are suggestions on posting data in Frank's link. 2. *I* removed the tidyverse tag, you removed tidyr and dplyr, both of which were fine to have here — camille, Jul 18 '18 at 16:40
Here's an example: https://stackoverflow.com/questions/47380610 The code at the top can be copy pasted by anyone into their R session so we are all looking at the same input object. The second code block shows the exact output the OP wants corresponding to that input. You don't need to provide the exact data you're working with to illustrate the problem; and it is helpful (in my opinion) to show the desired output, not just describe it. — Frank, Jul 18 '18 at 16:43
If I understand your description correctly, though, maybe `cSplit` from splitstackshape would work: https://stackoverflow.com/search?tab=votes&q=csplit%20is%3aa%20%5br%5d — Frank, Jul 18 '18 at 16:44
Can you give a small dataset that one can copy and also its expected output? — Onyambu, Jul 18 '18 at 17:06
@Onyambu I have given 3 rows from the dataset. The rest of the data is quite similar with missing values in some columns. — rmahesh, Jul 18 '18 at 17:35
@Frank I have given 3 rows from the dataset. The rest of the data is quite similar with missing values in some columns. — rmahesh, Jul 18 '18 at 17:35
@camille I have edited the post to include the two tags you mentioned. Also, included the first couple rows of the data. — rmahesh, Jul 18 '18 at 17:37
This is more or less a dupe of https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows — alistaire, Jul 18 '18 at 22:19

score 2 · Answer 1 · answered Jul 18 '18 at 17:01

I think by tidying you mean split the contents of delimited columns and separate into rows. You can either take the first element or take each element as its own row.

df<-data.frame(instance=1:5, 
           gun_type=c("", "0::Unknown||1::Unknown", "", 
                      "0::Handgun||1::Handgun", ""), stringsAsFactors=FALSE)

df$first<-sapply(strsplit(df$gun_type, "\\|\\|"), '[', 1)
splitType<-strsplit(df$gun_type, "\\|\\|")
df.2<-df[rep(1:nrow(df), sapply(splitType, length)),]
df.2$splitType<-unlist(splitType)

If you want just the unique values then use:

splitTypeUnique<-sapply(splitType, unique)
df.2<-df[rep(1:nrow(df), sapply(splitTypeUnique, length)),]
df.2$splitType<-unlist(splitTypeUnique)

but you will have to do a little wrangling to get the unique part to work

Yes I sort of mean it like that. So for example say that the gun_type has the following classes: Unknown, Handgun, Shotgun, Assault Rifle. As it is imported, it is all stored in one row. I have merged two columns in creating a column called 'Number_affected". I would like to know how many times each class of 'gun_type' contributed to the 'number_affected'. Possibly make each of the types in 'gun_type' a separate columns? — rmahesh, Jul 18 '18 at 17:54

Aurèle · Accepted Answer · 2018-07-19T15:56:44.560

As Frank said in the comments, "tidy" can mean different things. Here we turn all specified columns in just two: one with the original column name ("key"), the other with the individual values after splitting the strings and removing the prefixes, one row for each ("value").

library(tidyr)
library(dplyr)
library(stringr)

myvars <- c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')

res <- as_tibble(df2) %>% 
  tibble::rowid_to_column() %>%
  # Split strings in selected columns at "||". This turns those columns in 
  # list-columns of character vectors
  mutate_at(myvars, str_split, pattern = fixed("||")) %>% 
  # Go from wide to long format: in the new 'key' column are the original column 
  # names, and 'value' is the one list-column of character vectors
  gather(key, value, one_of(myvars)) %>% 
  # unnest turns the 'value' list-column into a regular character column, with 
  # duplication of rows that contain a 'value' of length greater than 1
  unnest(value) %>% 
  filter(value != "") %>% 
  # Remove the "x::" prefixes
  mutate(value = str_split_fixed(value, fixed("::"), n = 2)[, 2]) %>% 
  # Deduplicate
  distinct() %>% 
  arrange(rowid, key, value)

# # A tibble: 732,017 x 3
#    rowid key                value  
#    <int> <chr>              <chr>  
#  1     1 participant_age    20     
#  2     1 participant_gender Female 
#  3     1 participant_gender Male   
#  4     2 participant_age    20     
#  5     2 participant_gender Male   
#  6     3 gun_stolen         Unknown
#  7     3 gun_type           Unknown
#  8     3 participant_age    25     
#  9     3 participant_age    31     
# 10     3 participant_age    33     
# # ... with 732,007 more rows

Also expanding on @Ben G's comment:

res %>% 
  count(key, value) %>% 
  arrange(key, desc(n))

# # A tibble: 141 x 3
#    key             value                n
#    <chr>           <chr>            <int>
#  1 gun_stolen      Unknown         132099
#  2 gun_stolen      Stolen            7350
#  3 gun_stolen      Not-stolen        1560
#  4 gun_stolen      ""                 355
#  5 gun_type        Unknown          98892
#  6 gun_type        Handgun          17609
#  7 gun_type        9mm               6040
#  8 gun_type        Shotgun           3560
#  9 gun_type        Rifle             3196
# 10 gun_type        22 LR             3093
# 11 gun_type        40 SW             2624
# 12 gun_type        380 Auto          2323
# 13 gun_type        45 Auto           2234
# 14 gun_type        38 Spl            1758
# 15 gun_type        223 Rem [AR-15]   1248
# 16 gun_type        12 gauge           975
# 17 gun_type        Other              892
# 18 gun_type        7.62 [AK-47]       854
# 19 gun_type        357 Mag            800
# 20 gun_type        25 Auto            601
# 21 gun_type        32 Auto            481
# 22 gun_type        ""                 356
# 23 gun_type        20 gauge           194
# 24 gun_type        44 Mag             192
# 25 gun_type        30-30 Win          105
# 26 gun_type        410 gauge           96
# 27 gun_type        308 Win             88
# 28 gun_type        30-06 Spr           71
# 29 gun_type        10mm                50
# 30 gun_type        16 gauge            30
# 31 gun_type        300 Win             23
# 32 gun_type        28 gauge             6
# 33 participant_age 19               10541
# 34 participant_age 20                9919
# 35 participant_age 18                9826
# 36 participant_age 21                9795
# 37 participant_age 22                9642
# 38 participant_age 23                9383
# 39 participant_age 24                9204
# 40 participant_age 25                8562
# 41 participant_age 26                7815
# 42 participant_age 17                7416
# 43 participant_age 27                7228
# 44 participant_age 28                6528
# 45 participant_age 29                6055
# 46 participant_age 30                5652
# 47 participant_age 31                5145
# 48 participant_age 32                5039
# 49 participant_age 16                4977
# 50 participant_age 33                4662
# # ... with 91 more rows

In full transparency, I haven't used tidyr that much so I am a bit lost as to what your code is doing. So for example say that the gun_type has the following classes: Unknown, Handgun, Shotgun, Assault Rifle. As it is imported, it is all stored in one row. I have merged two columns in creating a column called 'Number_affected". I would like to know how many times each class of 'gun_type' contributed to the 'number_affected'. Possibly make each of the types in 'gun_type' a separate columns? — rmahesh, Jul 18 '18 at 18:22
@rmahesh wrt to this code, I will add explanations. As to your desired output, I stand by what Frank said in the comments: we have possibly different definitions of "tidy", it is best to show an example of the desired output (it is a bit of extra work to create by hand but it will clarify things). — Aurèle, Jul 19 '18 at 08:36
@rmahesh, this is the perfect answer for creating a tidy data set from your data. The best way to answer your question (" I would like to know how many times each class of 'gun_type' contributed to the 'number_affected'.) is to then perform summary calculations on the tidy data set. For example, `filter(key == "gun_type") %>% group_by(value) %>% summarize(total = n())`. If you have questions about how to do this you can post another question . . . — Ben G, Jul 19 '18 at 12:39
@BenG That was more along the lines of what I was looking for thank you! — rmahesh, Jul 19 '18 at 15:43

Difficulty converting wide format to tidy format in dataset

2 Answers2