1

I want to select 3117 columns out of a data frame, I tried to select them by column names:

dataframe %>% 
  select(
    'AAACCTGAGCACGCCT-1',
    'AAACCTGAGCGCTTAT-1',
    'AAACCTGAGCGTTGCC-1',
    ......,
    'TTGGAACCACGGACAA-1'
  )

or

firstpickupnames <- ('AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-1',......,'TTGGAACCACGGACAA-1')

Both ways the R console just replied

'AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-
1',......,'TTGGAACCACGGACAA-1'
+ )
+

What does this mean? Is there a limitation of columns that I can select in R?

Kim
  • 4,080
  • 2
  • 30
  • 51
Canary
  • 23
  • 2
  • 1
    [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – markus Aug 08 '18 at 22:33
  • 3
    I doubt that you literally typed 3117 names inside a `select` function call. Perhaps it would help to describe exactly what you did do instead. My guess is that you are running into a difficulty in pasting large strings directly from the clipboard. – John Coleman Aug 08 '18 at 22:36
  • 1
    If `my_names <- c('AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1', ...)` then `dataframe %>% select(my_names)` works. – John Coleman Aug 08 '18 at 22:54
  • I had a csv file contains the list of column names which I hope to select from a data frame, and then I added quotation marks to them which was opened with txt file later,that's how I "typed" 3117 names. When I reduce the column numbers to about 100, the selection goes well. So I'm not sure whether it's because of number limitation that I cannot select 3117 columns at one time.Thanks for reminding me. – Canary Aug 09 '18 at 01:40
  • Thanks for John Coleman's answer, I finally figure out my problem is caused by wrong input of long commands and not by the limitation of columns. – Canary Aug 20 '18 at 16:19

2 Answers2

4

Without a reproducible example, it's difficult to know what exactly you're looking for, but dplyr::select() has several options for selecting columns, and dplyr::everything() might be what you're looking for:

library(dplyr)

# this reorders the column names, but keeps everything without having to name the columns specifically:
mtcars %>% 
  select(carb, gear, everything()) 

# from a list of column names:
keep_columns <- c('cyl','disp','hp')
mtcars %>% 
  select(one_of(keep_columns)) 

# specific names, and a range of names:
mtcars %>% 
  select(hp, qsec:gear) 

#You could also use  `contains()`, `starts_with()`, `ends_with()`, or `matches()`. Note that calling all of the following at once will give you no results:
mtcars %>% 
  select(contains('t')) %>%
  select(starts_with('a')) %>% 
  select(ends_with('b')) %>% 
  select(matches('^m.+g$')) 
sbha
  • 9,802
  • 2
  • 74
  • 62
  • Thanks for your help! I used code like this one as you listed: keep_columns <- c('cyl','disp','hp') – Canary Aug 09 '18 at 01:40
  • Thanks for your reply! I used the code like this one you listed: keep_columns <- c('cyl','disp','hp') But my problem is that I want to select >3000 columns.When I reduce the column numbers to about 100, the selection goes well.So I'm not sure whether it's because of too many columns I tried to select at one time that the console stopped working. If it is the case, then I want to know up to how many columns I can select at one time. – Canary Aug 09 '18 at 01:48
  • Do the column names have any patterns that you could use to select? If all the columns you need are in a similar format to the samples from your question, you could use `select(matches('^[A-Z]+-[0-9]+$'))`. This regex matches columns that start with at least one capital letter, then a dash, then ends with at least one number. You can `select(matches())` using multiple patterns if needed: `select(matches('regex_pattern1'), matches('regex_pattern2'))`. You can also select using multiple ranges: `select(a1:a100, b1:b50, c101:c200)` if you know the column name order. – sbha Aug 09 '18 at 19:05
  • Or another idea - is possible to drop the column names you don't need rather than select everything you need? – sbha Aug 09 '18 at 19:05
2

The way that the console replies (with the + indicating that it is waiting for the rest of the expression) strongly suggests that you are encountering a limitation in the capacity for the console to process long commands (which you are attempting to assemble via pasting from the clipboard) rather than an inherent limit in the number of columns which can be selected. The only place I could find in the documentation to this limitation is here where it says "Command lines entered at the console are limited to about 4095 bytes."

In the comments you said that the column names that you wanted to select were in a csv file. You didn't say much about the structure of the csv file, but say that you have a csv file that contains a single list of column names. As an example, I created a file named "colnames.csv" which has a single line:

Sepal.Width, Petal.Length

Note that there is no need to manually place quote marks around the column names in the text file. Then in the R console I typed:

iris %>% select(one_of(as.character(read.csv("colnames.csv",header = FALSE, strip.white = TRUE,stringsAsFactors = FALSE))))

which worked as expected. Even though this example only used 2 columns, there is no reason that it should fail with 3000+, since the number of columns per se wasn't the problem with what you were doing.

If the structure of the csv file is different from the example then you would need to adjust the call to read.csv and perhaps the way that you convert it to a character vector, but you should be able to tweak this approach to your situation.

John Coleman
  • 51,337
  • 7
  • 54
  • 119