0

Can anyone help me working out how to count the number of instances of a character in a cell per row? I have a file with 10 million snps that I want to sort.

Direction
?????+-+-
?+-+-????
?-+-+??-+

Above is an example of one of many columns that I have. What I want to do is count the number of "?" characters in each row individually and add a new column with that count as a numerical value.

I'm a total beginner thrown in the deep end with this so any help would be appreciated.

Thanks.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • Do you really have a format like this? SNPs are usually 0s, 1s, 2s or NAs.. – F. Privé Jul 21 '17 at 17:51
  • Please read [How to make a great reproducible example in R?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – M-- Jul 21 '17 at 17:52
  • 1
    Try `nchar(gsub("[^\\?]", "", Direction))` – juan Jul 21 '17 at 17:53
  • This is just one of the columns, the data indicates whether or not a particular SNP was found in a study (the datatable has 15 studies in it) and the direction of its effect. So it might look like Direction ???-++++--???-- I am trying to find a way to count the "?" so I know how many studies that particular SNP was found in. At the moment the direction doesn't matter. – Tired Medic Jul 21 '17 at 18:10
  • 2
    @juan The escape character is not necessary here, `"[^?]"` is sufficient. – lmo Jul 21 '17 at 18:10
  • 1
    @Imo, Good to know! – juan Jul 21 '17 at 18:11

1 Answers1

1

Two answers for you

a <- data.frame(direction = c("?????+-+-", "?+-+-????","?-+-+??-+"),  
 stringAsFactors = F)
a$return <- lengths(regmatches(a$direction, gregexpr("\\?", a$direction)))

or as per comments

a$return <- nchar(gsub("[^?]", "", a$direction))

Both return

'data.frame':   3 obs. of  2 variables:
 $ direction: chr  "?????+-+-" "?+-+-????" "?-+-+??-+"
 $ return   : int  5 5 3

There are tons of ways to do this depends on what you're looking for.

Update

While it may not be base R, the packages in the tidyverse are useful for data wrangling and can be used to string together a few calls easily.

install.packages("dplyr")
library(dplyr)
df <- data.frame(Direction = c("???????????-?", "???????????+?", "???????????+?", "???????????-?"), stringsAsFactors = F)
df %>% 
  mutate(qmark = nchar(gsub("[^?]", "", Direction)),
         pos = nchar(gsub("[^+]", "", Direction)),
         neg = nchar(gsub("[^-]", "", Direction)),
         qminus = qmark-(pos+neg),
         total = nchar(Direction))  


      Direction qmark pos neg qminus total
1 ???????????-?    12   0   1     11    13
2 ???????????+?    12   1   0     11    13
3 ???????????+?    12   1   0     11    13
4 ???????????-?    12   0   1     11    13

If your dataset is 10 million lines long however, you might want to use stringi based on some benchmark testing.

install.packages("stringi")
library(stringi)
df %>% 
  mutate(qmark = stri_count(Direction, fixed = "?"),
         pos = stri_count(Direction, fixed = "+"),
         neg = stri_count(Direction, fixed = "-"), 
         qminus = qmark-(pos+neg))
Community
  • 1
  • 1
Geochem B
  • 418
  • 3
  • 13
  • I don't know if I'm not asking the right questions or don't know enough about R to format things properly but non of the suggestions are working. Neither doing what I wanted or doing what you say they will do. Thank you for your help. – Tired Medic Jul 21 '17 at 18:25
  • As said in the comments, if you could provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) the answer will work for *your* data. I made a `data.frame` to provide a solution but your data may have a different structure and thus R will handle it differently. – Geochem B Jul 21 '17 at 18:29
  • okay I have got somewhere using "top$DirectionCount = nchar(gsub("[^\\?]", "", top$Direction))" Now I have the "?" in a new column of my table! Great stuff. – Tired Medic Jul 21 '17 at 18:31
  • Now I'd like to make a new column with that "?" count minus the count of + and - characters in the column. Is that also possible? – Tired Medic Jul 21 '17 at 18:32
  • Apologies if this is frustrating or ridiculously basic. – Tired Medic Jul 21 '17 at 18:33
  • Read through the [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) post, produce a sample of your data, and state exactly what you need. No one wants to answer the wrong question even if it's for fake internet points, and being new to R understanding data structure is essential. – Geochem B Jul 21 '17 at 18:35
  • Okay I will do that. Thanks for the advice. – Tired Medic Jul 21 '17 at 18:41
  • Okay I will do that. Thanks for the advice. Just as an update, I managed to do what I wanted probably in the most roundabout way possible. > top$DirectionCount = nchar(gsub("[^\\?]", "", top$Direction)) > top$DirectionCountneg = nchar(gsub("[^\\-]", "", top$Direction)) > top$DirectionCountpos = nchar(gsub("[^\\+]", "", top$Direction)) To add 3 columns. Then, Sequentially took the values for "-" and "+" from the count for "?". I think I will look for a more elegant solution. – Tired Medic Jul 21 '17 at 18:47
  • I'm a recent grad that got into data work in in college, so I understand needing to get some R work done even if it's over your head. There are some great intro resources like [r for data science](http://r4ds.had.co.nz/), or if you want specifically [(meta)genomics](http://evomics.org/learning/programming/introduction-to-r/) people have written some tutorials because it's tough. Keep scripting and it will become easier. Also, use Ctrl+K or the ` symbols around code so it's easy to read. – Geochem B Jul 21 '17 at 18:50
  • What is the output from running `str(your data)`? – Geochem B Jul 21 '17 at 18:53
  • It outputs a list of the 26 column headers or variables with examples of the observed values in each column or variable. Oh and if the value is a character, number, integer etc – Tired Medic Jul 21 '17 at 19:01
  • can you paste it here or `dput(head(your data))`? There are issues with string matching if that information isn't known from your question. – Geochem B Jul 21 '17 at 19:03
  • ' $ Freq1 : num 0.9889 0.012 0.0104 0.9887 0.9887 ... $ FreqSE : num 0 0 0 0 0.0004 0 0 0 0 0.0012 ... $ MinFreq : num 0.9889 0.012 0.0104 0.9887 0.9869 ... $ MaxFreq : num 0.9889 0.012 0.0104 0.9887 0.9895 ... $ Effect : num -0.286 0.277 0.293 -0.292 -0.238 ... $ StdErr : num 0.0353 0.0347 0.037 0.039 0.0332 ... $ P.value : num 5.06e-16 1.43e-15 2.20e-15 6.83e-14 6.98e-13 ... $ Direction : chr "???????????-?" "???????????+?" "???????????+?" "???????????-?" ...' – Tired Medic Jul 21 '17 at 19:05