Reading the length of the numbers separated by commas in a excel cell from R

Question

I have a .csv file, which I am reading from R. There is one column which contains the cells as : suppse

cell C1= 2,3 C2= 1,2,3,4 C3 = 1 and so on...

EDIT: C1 Stands for column C and 1st row.

I just want to read the length of the numbers in those cells from R. How one can do that?

Does anyone has any clue?

Reading the data from excel.

data=read.csv("location", header=T)

one of the columns of the data whose length cell I need to calculate.

V24

1,2,3,4




1,2,3,4

1,4,2,3


1,2,4,3




1,3,2,4

4,3,1,2

This data is too big; hence I can not paste it here.

snap shot of the data; 12 columns and 35 rows.

EDIT 1 :

dput(string_data)
structure(list(v_1 = c(NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
1L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 2L
), v_2 = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3L, NA, NA, 
NA, NA, NA, 2L, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA, 2L), v_3 = structure(c(1L, 
1L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 4L, 
1L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 5L, 1L, 1L, 1L, 1L, 2L), .Label = c("", "1,4", "2", 
"2,1", "2,4", "3", "4"), class = "factor"), v_4 = c(NA, NA, NA, 
NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, 
NA, NA, NA, 0L, 0L, NA, NA, NA, NA, 0L, 2L, NA, 0L, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA), v_5 = c(NA, NA, NA, NA, NA, 0L, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA, 
2L, 2L, NA, NA, NA, NA, 2L, 0L, NA, 0L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA), v_6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA), v_7 = c(NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 
1L, NA, NA, NA, NA, 0L, NA, NA, 0L, NA, NA, 1L, NA, NA, 0L, 0L, 
NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 0L, 
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA), v_8 = c(NA, 
NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 
1L, NA, NA, 1L, NA, NA, 0L, NA, NA, 1L, 1L, NA, NA, NA, NA, 1L, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, 1L, NA, NA, NA, NA, 0L, 
NA, NA, NA, NA, NA, NA, 0L, NA, NA), v_9 = c(NA, NA, NA, NA, 
1L, NA, NA, NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA, 1L, NA, NA, 
3L, NA, NA, 4L, NA, NA, 3L, 3L, NA, NA, NA, NA, 3L, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 4L, 3L, NA, NA, NA, NA, 4L, NA, NA, NA, 
NA, NA, NA, 4L, NA, NA), v_10 = c(NA, 5L, NA, NA, NA, 0L, 3L, 
NA, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 3L, NA, NA, 3L, NA, NA, NA, 
NA, NA, NA), v_11 = c(NA, 0L, NA, NA, NA, 0L, 2L, NA, 2L, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 2L, NA, NA, 2L, NA, NA, NA, NA, NA, NA
), v_12 = structure(c(1L, 4L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "3", "4", "4,1,3"), class = "factor")), .Names = c("v_1", 
"v_2", "v_3", "v_4", "v_5", "v_6", "v_7", "v_8", "v_9", "v_10", 
"v_11", "v_12"), class = "data.frame", row.names = c(NA, -58L
))

have you tried something? Please share your issues. Plus there are at least 2 questions there (how to read from excel to R, then how to count the length of the cell), I suggest you split the questions as the purpose of this forum is not to do your entire project. — RockScience, Feb 23 '15 at 04:06
@RockScience, I am not sure how to proceed in this matter, as I never came across this kind of problem before. — , Feb 23 '15 at 04:09
I dont understand why negative marking!!! This is genuine doubt! — , Feb 23 '15 at 04:11
@Irri can you add in your question: 1) the command you use to import the data 2) the output of `dput(data)` where `data` is the data that you already imported from the the csv. Your question is currently still unclear, hence the negative rating (It is not clear whether the problem comes from the fact that the data comes from a csv file or that you don't know how to count the number of data) — RockScience, Feb 23 '15 at 04:20
I have edited my question further. @akrun, C is a column name. by C1 I meant first cell in column . I hope it is clear now. — , Feb 23 '15 at 04:28
@akrun, that code is good, but it is also returning 1 for the cells where there are no values. How can I make appear NA where there are no values in the cells. — , Feb 23 '15 at 04:33
@RockScience, Now I think I have edited my question to be more precise. Do I still deserve negative marking now? — , Feb 23 '15 at 04:37
@Irri it is better. However next time I suggest strongly that you post directly a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). See for instance the data akrun created in his answer. It is not good that there are 20 comments below your question to understand what exactly you are trying to do. — RockScience, Feb 23 '15 at 04:45
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/71472/discussion-between-irri-and-rockscience). — , Feb 23 '15 at 09:37

A5C1D2H2I1M1N2O1R2T1 · Answer 1 · 2015-02-23T05:20:17.527

2

In base R, the function that is used by read.table is count.fields, which you can use like this (using @akrun's sample data):

count.fields(textConnection(dat1$V24), sep = ",", blank.lines.skip = FALSE)
#  [1] 0 4 1 0 0 1 4 0 4 0 0 3 1 0 0 1 3

Replacing 0 with NA should be pretty straightforward.

Note that this is not identical with @akrun's approach, as this is meant to count how many columns there should be in the dataset. As such, "" is not the same as an empty string, hence the "1" values that are in my results but not in @akrun's. You can use gsub("\\s+", "", dat$V24) to get rid of those.

edited Feb 23 '15 at 05:20

answered Feb 23 '15 at 04:48

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Ananda, Thank you very much for your useful addition. Now I know two ways to approach my problem. Sorry I can accept only one answer :) – Feb 23 '15 at 04:52
for( i in 1:ncol(data)) { count.fields(textConnection(data[,3*i]), sep = ",", blank.lines.skip = FALSE) } I tried above code. to make those all the columns as desired. But it is throwing following error. How can I get rid of that error? P.S. my excel sheets every third column is of containg cells with commas . – Feb 23 '15 at 05:31
@Irri, what is the error? Also, why do you need to count these fields? What are you ultimately doing? – A5C1D2H2I1M1N2O1R2T1 Feb 23 '15 at 05:37
Error is : Error in textConnection(data[, j]) : invalid 'text' argument. The numbers separated by commas represent number of correct questions to an answer. I need to see how many coorect options are there per question. – Feb 23 '15 at 05:49

akrun · Accepted Answer · 2015-02-24T10:46:08.780

0

We could remove the leading/lagging spaces (in case there are), with str_trim, count the number of elements with str_count assuming that the delimiter is ,, the rows that are empty can be found out by nzchar, and with a little arithmetic, we can make those elements NA

 library(stringr)
 dat1$V24 <- str_trim(dat1$V24)
 with(dat1, str_count(V24, ',')+1 * NA^!nzchar(V24))
 #[1] NA  4 NA NA NA NA  4 NA  4 NA NA  3 NA NA NA NA  3

Similar functions in stringi would be faster

 library(stringi)
 dat1$V24 <- stri_trim_both(dat1$V24)
 with(dat1, stri_count(V24, fixed= ',')+1 * NA^!nzchar(V24))
 #[1] NA  4 NA NA NA NA  4 NA  4 NA NA  3 NA NA NA NA  3

Update

If you want to do this to every 3rd column of the dataset

 indx <- seq(1, ncol(dat2), by=3)
 lapply(dat2[indx], function(x) {r1 <- str_trim(x)
            str_count(r1, ',')+1 * NA^!nzchar(r1) })
 #$V1
 #[1] 1 4 1 1 3

 #$V4
 #[1]  4  1  2  3 NA

where,

  dat2[indx]
  #      V1       V4
  #1       1  1,2,5,6
  #2 1,2,3,4        1
  #3      3       1,2
  #4       1 15,23,24
  #5   1,2,3

Update2

  lapply(dat3[indx], function(x) {r1 <- str_trim(x)
              str_count(r1, ',')+1 * NA^is.na(r1)})
  #$V1
  #[1]  1  4  1 NA  3

  #$V4
  #[1]  4 NA  2  3 NA

Update3

Based on the dput of string_data, there are only two columns (3 and 12) which are factor class and they are string elements. i.e. 2,4, 1,4 etc.

 indx1 <- sapply(string_data, is.factor)
 lapply(string_data[indx1], function(x){r1 <- str_trim(x)
              str_count(r1, ',')+1 * NA^!nzchar(r1)})
 #$v_3
 #[1] NA NA NA NA  1 NA NA NA NA NA NA NA NA NA NA  1 NA NA NA NA NA NA NA NA NA
 #[26] NA  1 NA NA NA NA NA  2 NA NA NA  1 NA NA NA NA NA NA NA NA NA NA NA NA NA
 #[51] NA NA  2 NA NA NA NA  2

#$v_12
#[1] NA  3 NA NA NA NA  1 NA  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  1 NA
#[51] NA  1 NA NA NA NA NA NA

All the other variables are either integer or logical

data

 dat1 <- data.frame(V24=c('', '1,2,3,4', ' ', '', '', ' ',
 '1,2,3,4', '', '1,4,2,3', '', '', '1,2,4', ' ', '', '', ' ', 
 '1,3,2'), stringsAsFactors=FALSE)


 dat2 <- data.frame(V1=c('1', '1,2,3,4', '3 ', '1', '1,2,3'), V2=1:5, 
 V3=6:10, V4=c('1,2,5,6', '1', '1,2', '15,23,24', ' '), V6=11:15,
  stringsAsFactors=FALSE)

 dat3 <- data.frame(V1= c('1', '1,2,3,4', '3 ', NA, '1,2,3'), V2=1:5,
  V3=6:10, V4=c('1,2,5,6', NA, '1,2', '15,23,24', NA), 
  stringsAsFactors=FALSE)

edited Feb 24 '15 at 10:46

answered Feb 23 '15 at 04:42

akrun

874,273
37
540
662

can you please explain me lapply(dat2[indx], function(x) {r1 <- gsub('[0-9 ]+', '',x); (nchar(r1)+1)*(NA^!nzchar(r1))}) this line. Especially gsub line ; I dont understand what is [0-9] stand for there? – Feb 23 '15 at 05:57
Thank you for your explanation akrun. dat2 <- data.frame(V1=c('1', '1,2,3,4', '3 ', '1', '1,2,3'), V2=1:5, V3=6:10, V4=c('1,2,5,6', '1', '1,2', '15,23,24', ' '), V6=11:15, stringsAsFactors=FALSE) If you try with the above data it is still showing NA even if one value is there in that cell. – Feb 23 '15 at 06:36
yes I tried that but it is showing all the 2's where in the interest columns (column which is numbered as multiple of 3) . there are no values or small number of values. – Feb 23 '15 at 06:54
dat2 <- data.frame(V1=c('', '', ' ', '', '1,2,3'), V2=1:5, V3=6:10, V4=c('1,2,5,6', '', '1,2', '15,23,24', ' '), V6=11:15, stringsAsFactors=FALSE) Can you please try this data. – Feb 23 '15 at 07:00
indx <- seq(3, ncol(data), by=3) data[indx] <- lapply(data[indx], function(x) {r1 <- str_trim(x); str_count(r1, ',')+1 * NA^!nzchar(r1) }) Am I using the code correctly?? – Feb 23 '15 at 07:16
in my case indx in 3,6,9 only; multiple of 3. rest two columns have the integers entries. Does that help? I have posted screen shot of the data. – Feb 23 '15 at 07:26
I think the problem is in the type of the data. The way I am writing the data as an e.g. it is giving the desired result. But when I am reading from the excel it is not giving the desired result. When I am reading the data from the Excel the blank spaces are read as "NA" and when I am writing, it is reading as a proper blank space. – Feb 23 '15 at 07:43
@lrri Okay, In that case, the blank spaces would be anyway NA... Let me try with an example where NAs replace the empty strings or spaces. – akrun Feb 23 '15 at 07:45
@lrri Updated with `dat3` that mimics the data you mentioned – akrun Feb 23 '15 at 07:52
with dat3 I am getting v1 and v4 as all 1s. – Feb 23 '15 at 08:06
@Irri Not for me (as I showed in the post), Please copy/paste the data I created and the code. – akrun Feb 23 '15 at 08:08
I also got the same results when I am doing it in R completely. reading the data from the data.frame object. But same data I pasted in an excel file and I read that one. by deleting or keeping it same. Can you please try that data by pasting it in excel. – Feb 23 '15 at 09:32
Especially try deleting all the entries in columns v1 and/or v2 you will get all 2s. – Feb 23 '15 at 09:36
will you please check that? – Feb 23 '15 at 09:54
@lrri I am working on linux. I have only open office. Also, I showed both cases (empty quotes and with NAs). So, it may be better if you post the `dput` of the small dataset. i.e first read your excel file `yourdata <- read.csv(..., stringsAsFactors=FALSE)` and `dput(head(yourdata))` – akrun Feb 23 '15 at 13:36
I have added dput of the data in the question, please have a look. – Feb 24 '15 at 05:16

Reading the length of the numbers separated by commas in a excel cell from R

2 Answers2

Update

Update2

Update3

data