Select columns from a data frame

Question

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 .... A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries. How can I code that?

UPDATE, Example:

df=data.frame(industry = c("C10","C11","C12","C13"),
                   country = c("USA"),
                   AUS3 = runif(4),
                   AUS4 = runif(4),
                   AUS5 = runif(4),
                   AUS6 = runif(4),
                   DEU5 = runif(4),
                   DEU6 = runif(4),
                   DEU7 = runif(4),
                   DEU8 = runif(4))
#I'm interested only in C10-c11:

df_a=df %>% filter(grepl('C10|C11',industry))
df_a

#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?

Better use column names rather than numbers to select, definitely less error prone, then consider `paste0('AU', 1:56)`. — jay.sf, Sep 18 '22 at 11:19
Thanks for your prompt reply. The only problem is that I have plenty of countries and each column is labeled as countrycode+industry number, e.g. AUS40. Is there any more efficient way for doing that? — Maximilian, Sep 18 '22 at 11:29
You should make some effort and create a [reproducible example](https://stackoverflow.com/a/5963610/6574038). — jay.sf, Sep 18 '22 at 11:33

jay.sf · Accepted Answer · 2022-09-18T12:27:01.710

1

Demonstrating the paste0 approach.

ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU")  ## alternatively hard-coded

ind <- c(10, 11)

subset(df, industry == paste0('C', 10:11), 
       select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
#   industry country     AUS10     AUS11     DEU10     DEU11
# 1      C10     USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2      C11     USA 0.7421840 0.6808892 0.9050158 0.3689741

Or, since you appear to like grep you could do.

df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
#   industry country     AUS10     AUS11     DEU10     DEU11
# 1      C10     USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2      C11     USA 0.7421840 0.6808892 0.9050158 0.3689741

edited Sep 18 '22 at 12:27

answered Sep 18 '22 at 12:11

jay.sf

60,139
8
53
110

But the only 2 rows shall be C10 and C11 (as I did above) and the 4 remaining columns shall be AUS10 , AUS11 , DEU10, DEU11. The coding should be feasible for a large dataset – Maximilian Sep 18 '22 at 12:18
@Mike I see, then `subset` might be more convenient. see update. – jay.sf Sep 18 '22 at 12:22
+1, I like that regex approach to get the columns. Given OP wants to repeat this on up to around 20 columns for each country, there should be a way to adapt what you have slightly to get it to work for OP's real data? – diomedesdata Sep 18 '22 at 12:36
@diomedesdata The last part of the regex should just cover those columns, something like `[A-Z]{3}[3-4][0-9]|[A-Z]{3}2[0-3]|[A-Z]{3}5[0-6]` perhaps. – jay.sf Sep 18 '22 at 12:41

score 0 · Answer 2 · answered Sep 18 '22 at 12:23

If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.

It might be worth giving us a slightly larger toy example, if possible.

library(data.table)
setDT(df)


select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))


df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
   ..select_ctr] # select columns    

       AUS10     AUS11     DEU10     DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880

For more information, see Introduction to data.table.

Select columns from a data frame

2 Answers2