How to select specific columns from lots of dataframes in r

Question

I am handling 303 data frames and each data frame has different name as well as contains different columns. What I want is that select specific columns of which name contain "_CD" and "_NM" from those lots of data frames. Is there any good way to deal with it?

Note, those data frames were not listed.

Here is an example (I cannot bring up the real data but show you the fake with similar format)

table 1 name: AD_ITEM_CNT;

columns : CKND_CLSF_CD, SUBJ_CD, ITEM_CNT, UPDT_DT

table 2 name : BLAD_PE_XEXM

columns : PT_SBST_NO, CEXM_NM, CEXM_CD, CEXM_RST, CEX_RSLT_CMNT, LDNG_DT

table 3 name : BLAD_MR_HLTH

columns : PT_SBST_NO, RGST_DT, EDU_DGRE_CD, DRNK_YN, JOB_KIND_CD ...

The goal of this process is to create a table like the below

Table          |       column |  CD 
-----------------------------------
AD_ITEM_CNT    | CKND_CLSF_CD |  01
AD_ITEM_CNT    | CKND_CLSF_CD |  02
AD_ITEM_CNT    | CKND_CLSF_CD |  03
AD_ITEM_CNT    | CKND_CLSF_CD |  04
BLAD_PE_XEXM   | CEXM_CD      |  AVS
BLAD_PE_XEXM   | CEXM_CD      |  ABE
BLAD_PE_XEXM   | CEXM_CD      |  CVS
BLAD_PE_XEXM   | CEXM_CD      |  UVS

Yes, there is. Can you create a minimal reproducible example so it is easier for people to test their code? Maybe also show what you have tried? — markus, Feb 19 '20 at 22:34
I'm hoping that you have all 303 frames contained within a single `list` ... if not, I strongly suggest ... no, *urge* you to consider it: https://stackoverflow.com/a/24376207/. This might be as simple as `lapply(list_of_frames, function(x) x[, grepl("cy", names(x)), drop=FALSE])`. — r2evans, Feb 19 '20 at 22:47
@r2evans or make all the dataframes in the global workspace into a list with `my_list <- as.list(.GlobalEnv); my_list <- my_list[sapply(my_list, class) == "data.frame"]` — Allan Cameron, Feb 19 '20 at 22:56
AllanCameron, in my view, that's an acceptable method to get to the point of using a list-of-frames ... but if it is followed by anything that `assign`s them back to the global, them I'm definitely "out". — r2evans, Feb 19 '20 at 23:04
I am sorry for the late response. I added example and the specified the goal of this process. Thanks for your interest. — Sangwon Steve Lee, Feb 19 '20 at 23:11
Sangwon, I suggest you look at two things: (1) Good links on how to ask a question *well* so that it can quickly be tested and answered by others: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. (2) Less critically, how to format a question well so that it reads better (code text versus paragraph text, etc): https://stackoverflow.com/editing-help#syntax-highlighting. — r2evans, Feb 19 '20 at 23:17
Some specifics: your data is not in a format easily digestible, if done well, we can just copy and paste it into the R terminal and have nothing to worry about. The gold-standard for this is either programmatically building data (e.g., `data.frame(...)`) or the output from `dput(head(x))`, but some tables might be clear-enough if you paste the `data.frame` representation on the console (though there are many many times this does not work well enough). — r2evans, Feb 19 '20 at 23:19
BTW: inability to show real data is common, you just need to spend a moment and reduce the problem as much as you can and inject *fake* but *representative-enough* data. That is, if you need a string, integer, or floating-point, you may not need real or accurate numbers. — r2evans, Feb 19 '20 at 23:20
You say you want to *"select specific columns"* but it appears that you are counting them or something similar. Can you clarify what you have, what you need, how you think you need to get there, and any code you've tried so far? — r2evans, Feb 19 '20 at 23:22
Okay. so, the dataset that I am dealing with is based on clinical information. Each data frame contains different set of medical information. For instance, AD_ITEM_CNT table contains medical records of treatment prescription for each patients who have visited a hospital. — Sangwon Steve Lee, Feb 19 '20 at 23:26
Columns ending with "_CD" or "_NM" indicates factor formatted variables. — Sangwon Steve Lee, Feb 19 '20 at 23:29
Ddi you try r2evans' suggestion? Can you put all the data in a list so that it is easy to manage? You can then select specific column easily from it. — Ronak Shah, Feb 20 '20 at 01:49

score 0 · Answer 1 · answered Feb 20 '20 at 12:27

As @r2evans said, I more than encourage you to start with loading all your data.frames into one list. However, if you don't have any other option, you can do this:

First, lets create 5 data.frames:

for (i in 1:5) { 

  set.seed(123)
  assign(paste0("y",i), data.frame(A_CD = rnorm(1), 
                                   A_NM = rnorm(1), 
                                   A_GK = rnorm(1)))

}

Next, we can create a list of all of our variables (using ls()) and filter (Filter) for only the data.frames:

dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))

Lastly, we can filter the data.frames inside the list for the needed columns (have CD and NM in them):

dfs <- lapply(dfs, function(x) x[, grepl("CD|NM", names(x)), drop=FALSE])

Output:

dfs

$y1
        A_CD       A_NM
1 -0.5604756 -0.2301775

$y2
        A_CD       A_NM
1 -0.5604756 -0.2301775

$y3
        A_CD       A_NM
1 -0.5604756 -0.2301775

$y4
        A_CD       A_NM
1 -0.5604756 -0.2301775

$y5
        A_CD       A_NM
1 -0.5604756 -0.2301775

How to select specific columns from lots of dataframes in r

1 Answers1