0

I need to merge multiple files into one data frame based on ID. Final Data frame is in form of map with ID's as below:

+-------+-----+-------------+-------------+---------------+
| Name  | ID  | CategoryID1 | CategoryID2 | CategoryID400 |
+-------+-----+-------------+-------------+---------------+
| name1 | ID1 |           0 |           1 |             0 |
| name2 | ID2 |           1 |           1 |             0 |
| name3 | ID3 |           0 |           0 |             0 |
| name4 | ID4 |           1 |           0 |             1 |
+-------+-----+-------------+-------------+---------------+

Those are binary variables (categories) and I need to assign 1 if it occurs no matter how many times. I have empty data frame (map) with col names and need to fill it with data merged from multiple files.

Data files to be merged and filled into one files looks as below. There can be replies so the same ID in 2 files may have assigned both categories but it does not matter, only important is it appears and 1 is to be assigned to master data frame.

+-------+-----+---------------------------------------------------------------+
| name1 | ID1 | CategoryID1 CategoryID4                                       |
| name2 | ID2 | CategoryID1 CategoryID2 CategoryID9                           |
| name3 | ID4 | CategoryID150 CategoryID200 CategoryID400                     |
| name4 | ID4 | CategoryID1 CategoryID4 CategoryID7 CategoryID15 CategoryID89 |
+-------+-----+---------------------------------------------------------------+

Creating a empty data frame is not problem, just wonder how to loop through the files. Important is raw files are \t separated for 3 columns but categories are separated by space.

Mario
  • 25
  • 6
  • 1
    Take it step by step: 1. read all files into a single dataframe - `df1 <- rbindlist(lapply(list.files(), fread ....)))` 2. Separate categories columns on tab and stack. 3. Then get unique rows, and convert long to wide format. – zx8754 Jun 28 '17 at 09:35

1 Answers1

0

There is a bunch of ways to implement string to dummies conversion but there is a library called splitstackshape exactly for these cases. See here for examples. Here some code that does what you wanted:

install.packages("splitstackshape")
library(splitstackshape)

# Read your original file (be sure to store all data inside it)
# data <- read.delim(...)
data <- structure(list(name = c("name1", "name2", "name3", "name4"), 
               id = c("ID1", "ID2", "ID4", "ID4"), categories = c("CategoryID1 CategoryID4", 
                                                                  "CategoryID1 CategoryID2 CategoryID9", "CategoryID150 CategoryID200 CategoryID400", 
                                                                  "CategoryID1 CategoryID4 CategoryID7 CategoryID15 CategoryID89"
               )), .Names = c("name", "id", "categories"), class = "data.frame", row.names = c(NA, -4L))

#  name  id                                                     categories
#1 name1 ID1                                       CategoryID1 CategoryID4
#2 name2 ID2                           CategoryID1 CategoryID2 CategoryID9
#3 name3 ID4                     CategoryID150 CategoryID200 CategoryID400
#4 name4 ID4 CategoryID1 CategoryID4 CategoryID7 CategoryID15 CategoryID89


# create a dataframe with dummies
data_dummies <- cSplit_e(data, "categories",
              sep = " ",
              type = "character",
              drop = TRUE,
              fill = 0)

# Clear column names
colnames(data_dummies) <- gsub("categories_", "", colnames(data_dummies))

data_dummies

#   name  id CategoryID1 CategoryID15 CategoryID150 CategoryID2 CategoryID200 CategoryID4 CategoryID400 CategoryID7 CategoryID89 CategoryID9
#1 name1 ID1           1            0             0           0             0           1             0           0            0           0
#2 name2 ID2           1            0             0           1             0           0             0           0            0           1
#3 name3 ID4           0            0             1           0             1           0             1           0            0           0
#4 name4 ID4           1            1             0           0             0           1             0           1            1           0
Alex Knorre
  • 620
  • 4
  • 15