0

This is not a easy problem for me, to be honest. I have searched quite a long time but there seems no similar question.

Here's how a few rows and columns of my data looks like:

                                V1        V2       V3
1 74c1c25f4b283fa74a5514307b0d0278 1#11:2241 1#10:249
2 08f5b445ec6b29deba62e6fd8b0325a6  20#7:249  20#5:83
3 4b7f6f4e2bf237b6cc58f57142bea5c0  4#16:249   24:913

So, the cells are in a format like "class(#subclass):value". I want to make a table like this:

                                V1      1#10     1#11    4#16    20#5    20#7    24
1 74c1c25f4b283fa74a5514307b0d0278       249     2241       0       0       0     0      
2 08f5b445ec6b29deba62e6fd8b0325a6         0        0       0      83     249     0
3 4b7f6f4e2bf237b6cc58f57142bea5c0         0        0     249       0       0   913

Because I haven't met this kind of data structure before, I am not sure if this is the best way to store it. But so far, this is the only table format I could come up with. If you have any suggestion about it, please leave a comment.

Then, I first parsed it as the following:

                            V1     V2_1_1 V2_1_2 V2_2_1 V3_1_1 V3_1_2 V3_2_1
1 74c1c25f4b283fa74a5514307b0d0278      1     11   2241      1     10    249
2 08f5b445ec6b29deba62e6fd8b0325a6     20      7    249     20      5     83
3 4b7f6f4e2bf237b6cc58f57142bea5c0      4     16    249     24     NA    913

Now, I don't know how to convert it to the table format I want. Any package in R can I use to do it?

two links are attached below

original data: https://www.dropbox.com/s/aqay5dn4r3m3kdp/temp1TrainPoiFile.R?dl=0

parsed data: https://www.dropbox.com/s/0oj8ic1pd2rew0h/temp3TrainPoiFile.R?dl=0

Thank you very much for you help. Please leave a comment if there's any question about it.

Thanks for Walt's and Jack's answer. I used tidyr to solve the problem. Below is how I did it.

Read file

source("temp1TrainPoiFile.R")

gather columns to key-value pair

temp2TrainPoiFile <-  temp1TrainPoiFile %>% gather( key=V1, value=data, -V1)

extract to two columns

temp3TrainPoiFile <- temp2TrainPoiFile %>% extract(col=data, into=c("class","value"), regex="(.*):(.*)")

adding row numbers

row <- 1:nrow(temp3TrainPoiFile)
temp3TrainPoiFile <- cbind(row, temp3TrainPoiFile)

spread key-value to two columns

TrainPoiFile <- temp3TrainPoiFile %>% spread(key=class, value=value, fill=0)
cheddar
  • 17
  • 9
  • You should post `dput(head(dfrm))`. At the moment the print representation of this data is highly ambiguous. For this kind of manipulation you will find it easier to work with character data rather than factor data. These are key-value data and you are trying to construct a dataset in "wide" format. It looks as though you "over-parsed" the data. You what to keep the octothorpes in their original association with the numbers. Column names are not supposed to start with decimal characters so you may be making more work than necessary in the future unless you accept leading "X"'s as column names. – IRTFM Jun 18 '16 at 01:42
  • Searching on key-value and wide brings up this possibly useful set of answers: http://stackoverflow.com/questions/29952780/most-efficient-way-to-read-key-value-pairs-where-values-span-multiple-lines – IRTFM Jun 18 '16 at 01:59
  • @42- thank you very much for you quick comment. I have attached two files above. I will check the link you posted. – cheddar Jun 18 '16 at 02:06

2 Answers2

1

This looks like a good example of the use of the tidyr package. Use gather to transform into a two column data frame using column V1 as the key and the other columns as the value column named data, extract to split the data column into class and value columns, and then spread to use the class column as new column names and the value columns as values. Code would look like:

library(tidyr)
library(dplyr)
class_table <- df %>% mutate(row = 1:nrow(.)) %>% 
               gather( key=V1, value=data, -c(V1,row)) %>%
               extract(col=data, into=c("class","value"), regex="(.*):(.*)") %>%
               spread(key=class, value=value, fill=0)

Edited to ensure uniqueness of row identifiers. mutate requires dplyr package.

WaltS
  • 5,410
  • 2
  • 18
  • 24
  • Thank you Walt. I did some search and finally found this beautiful package. I am learning it now. – cheddar Jun 18 '16 at 14:57
  • Hi, Walt. I used your code but encountered an Error `Error: Duplicate identifiers for rows (5832, 5898, 5964,...)`. I found this [link](http://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr) answered the question and I solved it. But I don't understand his explanation about how this error happens. Could you explain it a little bit. – cheddar Jun 18 '16 at 15:17
  • The basic issue is that `spread` needs to have a set of columns which, taken together, uniquely identify each row in the result. In the example you linked to, the problem was that there are multiple values of the `feature.measures` for each value of `Species`. In your case, the error messages probably meant that you had duplicate values in the `V1` column. In the link problem, hadley (Hadley Wickam, the author of many of the RStudio packages) suggests adding row numbers to the input data. I've edited my code to include this. `row` column can be removed from result if not needed. – WaltS Jun 18 '16 at 16:45
0

Read in data

data <- source("temp1TrainPoiFile.R")[[1]]

Proper NAs

data[data == ""] <- NA

Reshape it into long format

data <- do.call(rbind, lapply(split(data, data[,"V1"]), function(n) {
                                  id <- n[,1]
                                  n <- na.omit(unlist(n[,-1]))
                                  n <- strsplit(n, ":")
                                  n <- do.call(rbind, lapply(n, function(m) data.frame(column = m[1], value = m[2])))
                                  n <- data.frame(id = id, n)
                                  n}))

Prepare for loop to insert the values into a newly created matrix

id <- unique(data[,"id"])
column <- unique(data[,"column"])
mat <- matrix(data = NA, nrow = length(id), ncol = length(column))
rownames(mat) <- id
colnames(mat) <- column

Insert the values

for(i in 1:nrow(data)) {
    mat[data[i, "id"], data[i, "column"]] <- data[i,"value"]}
RJ-
  • 2,919
  • 3
  • 28
  • 35
  • Thanks, Jack. I spent some time trying to understand your code. It's very smart. But I have to say I prefer using tidyr now because it seems to have more power to do with key-value pair data and it is simpler. Anyway, thx for bring yours up. – cheddar Jun 18 '16 at 14:56