Subset data /extracting data based on first 7 letters

Question

I have a huge data set with genotypic information from different populations. I would like to sort the data by population, but I don't know how.

I would like to sort by "pedigree_dhl". I was using the following code, but I kept getting error messages.

newdata <- project[pedigree_dhl == CCB133$*1,  ]

My problem is also, that 'pedigree-dhl' contains all the names of the individual genotypes. Only the first 7 letters in the column 'pedigree-dhl' are the population name.In this example:CCB133. How can I tell R, that I want to extract the data for all columns, that contain CCB133?

  Allele1 Allele2      SNP_name gs_entry pedigree_dhl
1       T       T ZM011407_0151      656    CCB133$*1
2       T       T ZM009374_0354      656    CCB133$*1
3       C       C ZM003499_0591      656    CCB133$*1
4       A       A ZM003898_0594      656    CCB133$*1
5       C       C ZM004887_0313      656    CCB133$*1
6       G       G ZM000583_1096      656    CCB133$*1

`substr` allows you to extract substrings of a character vector. You'll want to make sure your column is in fact a character vector and not a factor before using `substr` or you may get some unexpected results. For subsetting, search SO for `R subset` and you'll find many answers. The `subset()` function itself is quite useful for interactive session, while using the `[` operator is preferred in certain situations. — Chase, Apr 25 '12 at 16:08

score 9 · Accepted Answer · edited May 23 '17 at 10:32

9

You may want to consider grep as in the answer on Using regexp to select rows in R dataframe. Adapted to your data:

df <- read.table(text="  Allele1 Allele2      SNP_name gs_entry pedigree_dhl
1       T       T ZM011407_0151      656    CCB133$*1
2       T       T ZM009374_0354      656    CCB133$*1
3       C       C ZM003499_0591      656    CCB133$*1
4       A       A ZM003898_0594      656    CCB133$*1
5       C       C ZM004887_0313      656    CCB133$*1
6       G       G ZM000583_1096      656    CCB133$*1", header=T)

# put into df1 all rows where pedigree_dhl starts with CCB133$
p1 <- 'CCB133$'
df1 <- subset(df, grepl(p1, pedigree_dhl) )

But your question implies that you may want to select out the seven letter name, or just to sort the rows by pedigree name and it may be easier to keep all rows together in a sorted dataframe. All these three operations: sub-setting, extracting a new column, or sorting, may be carried out independently.

# If you want to create a new column based
# on the first seven letter of SNP_name (or any other variable)

df$SNP_7 <- substr(df$SNP_name, start=1, stop=7)

# If you want to order by pedigree_dhl
# then you don't need to select out the rows into a new dataframe

df <- df[ with(df, order(df$pedigree_dhl)), ]

All this may be obvious -- I add them simply for completeness.

edited May 23 '17 at 10:32

Community

1
1

answered Apr 25 '12 at 16:26

daedalus

10,873
5
50
71

`read.table` as of 2.14, I think, now takes text argument directly, meaning no need for the `textConnection` – Tyler Rinker Apr 25 '12 at 16:50
Thanks @Tyler Rinker; old habits die hard. However I tried it with no success. I also looked into the `help` and see that the `file` parameter can be a path to a local file, a text connection, or a URL. Happy to see alternatives though, in an edit, as I would love a more elegant way of doing this. – daedalus Apr 25 '12 at 17:02
@gauden and @Tyler, many thanks for the good answer. There are a couple things I'd suggest: 1) the OP asked for a substring of `pedigree_dhl`, not `SNP_name`. 2) When ordering you can omit the `df$` and simply do `df[with(df, order(pedigree_dhl)), ]`. I considered making (or rather suggesting) these changes in the answer directly, but they are rather extensive, especially the first. – BenBarnes Apr 25 '12 at 19:50
@Ben I suspect gauden used the `SNP_name` because in the original post the OP had uploaded a data set that had the `pedigree_dhl` column under the rest of the data set (most likely due to the console window being to narrow). gauden probably used the `SNP_name` b/c getting the data into R was more of a pain. I have since edited the OP's original thread to have a more friendly data set. – Tyler Rinker Apr 25 '12 at 19:54
I'm a little bit confused n and I think my question was not clearly stated.I would like to extract all the rows which have a specific population name. In the example above, I would like to extract all rows which have'pedigree_dhl'starting with CCB133. I have more names in the column'pedigree_dhl' and each pop-name has $B* and some numbers indicating an indiv.genotype. I would like to extract the data based on the first 7 letters of "pedigree_dhl' allele1 allele2 snp_name gs_entry Pedigree_DHL T T ZM008929_0774 4856 UEBB015_B$1 T T ZM011407_0151 4857 UEBB015_B$2 – marie Apr 25 '12 at 21:05
@Tyler Rinker, you are perfectly right. I used `SNP_name` for ease as well as to demonstrate that `grepl`could be much more powerful. @marie I have edited the example to be more directly referring to `pedigree_dhl`and I hope it is clearer now. – daedalus Apr 25 '12 at 21:44
So I was using the following command, but it is not working. vector = c("pedigree_dhl") grep("^[CCB133].*", vector, value=TRUE) – marie Apr 25 '12 at 21:49
In this regular expression you are using square brackets `[CCB133]` which means match ANY of the characters and not the whole character string. Also you seem to be applying this regex to the string "pedigree_dhl" rather than the dataframe. My suggestion: if your data is in a dataframe, say `df`, then this expression `subset(df, grepl("^CCB133.*", pedigree_dhl) )` will select the rows that you are looking for, ie, any rows in `df`where `pedigree_dhl` starts with CCB133 and is followed by a string of any other characters. – daedalus Apr 25 '12 at 22:13
@ gauden: thank you so much for your help. As I was running the command, another question arised: Is there a way, I can tell R to put the extracted data right away in an excel file and dont print it in the R console window. Because R keeps telling me, that there is not enough memory space to print it all. – marie Apr 25 '12 at 23:25
@marie that is a different question :) I think we have solved your data selection problem here (and if this was helpful you may want to give the green tick or at least an upvote for the answer). Your large dataset problem is a common one and you may find these answers on [Stackoverflow](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r) and on [Crossvalidated](http://stats.stackexchange.com/questions/3754/how-to-read-large-dataset-in-r) a good start. – daedalus Apr 25 '12 at 23:54

Subset data /extracting data based on first 7 letters

1 Answers1