How to subsetting dataframes based on a condition in R

Question

I have to data frames, returns.df and funds.df with identical dimensions:

returns1 <- c(0.1,0.2,0.5,0.9)
returns2 <- c(0.3,0.4,0.7,0.1)
returns.df <- data.frame(returns1,returns2)
returns.df

funds1 <- c("Fund A","Fund B","","Fund D")
funds2 <- c("Fund B","Fund C","","Fund A")
funds.df <- data.frame(funds1, funds2)
funds.df

I am trying to store 4 subsets of returns for each of the funds. For example, the subset Fund Returns for fund A would look like:

returns.FundA1 <- c(0.1,"","","")
returns.FundA2 <- c("","","",0.11)
returns.FundA.df <- data.frame(returns.FundA1, returns.FundA2)
returns.FundA.df

Basically I want to create a new data.frame where I just replace the Fund name in fund.df separately by its returns in returns.df, but in a new data.frame. Normally I would do this in excel across multiple sheets using a simple if function. But I am confident that this can be done much faster in R. I appreciate any inputs

score 0 · Accepted Answer · edited May 23 '17 at 12:16

0

It's personal preference, but I would reshape my data to long format, where everything is contained in the same object. Then subsetting (or performing operations for each fund) is easy. If you're doing the same thing for each fund, creating a separate dataframe for each is unnecessary, and it'll be a lot of work.

#create ID based on row number/name
#returns.df$id <- rownames(returns.df)
#funds.df$id <- rownames(funds.df)

returns.df$id <- 1:nrow(returns.df)
funds.df$id <- 1:nrow(funds.df)

#combine them (based on these IDs)
return.funds <- merge(returns.df, funds.df, by="id")

Then, we turn this combined, 'wide' dataset to long format. As we have two distinct measures that vary (return and fund, both have a first and a second value), we use data.table as it's melt-function is able to deal with multiple measure-columns. We set the pattern and names in the melt-function.

#turn to long
library(data.table)
return.funds.m <- melt(setDT(return.funds),
    measure.vars=patterns(c("returns","funds")),
    value.name=c("return","fund"))

Now our data looks like this:

   id variable return   fund
1:  1        1    0.1 Fund A
2:  2        1    0.2 Fund B
3:  3        1    0.5       
4:  4        1    0.9 Fund D
5:  1        2    0.3 Fund B
6:  2        2    0.4 Fund C
7:  3        2    0.7       
8:  4        2    0.1 Fund A
#example: select fund A

Say we wanted to get the mean return for both time-points for each fund, we could do this:

return.funds.m[,mean(returns), by=fund]

For how to run/examine multiple models, I refer you to this and this answer.

edited May 23 '17 at 12:16

Community

1
1

answered Feb 27 '16 at 12:12

Heroka

12,889
1
28
38

I actually thought of something like this. However, my dataset is enormous and fund names can change, which is why I find the best way to replace the fund name by its respective return in the same element. Also I am doing this, because I want to examine each fund by OLS regression seperately. – Oden Feb 27 '16 at 12:27
For the first part, I'm not sure what you mean. Can one fund have different names? For the second part, this can be done in data.table. – Heroka Feb 27 '16 at 12:33
I mean, I have multiple columns, where for example Fund A in columnC in row 2 can change it name to Fund B in columnD row 2. Wouldn't this cause a problem? I should perhaps tell the goal here. So I want to compute the arithmetic mean across each fund, over time. My idea was then create a subset of each funds, and then use the apply function. – Oden Feb 27 '16 at 12:45
I might not understand the problem correctly, but I don't think it will. I will add some more explanation in my answer. Still unsure what you mean by 'changed' names; the melt-function in data.table can handle multiple measures across multiple columns, as long as the naming is consistent. – Heroka Feb 27 '16 at 12:50
I think I might overthink it. I am trying to run your code – Oden Feb 27 '16 at 12:56
> > return.funds.m[fund=="Fund A",] Error: unexpected '>' in ">" > id variable return fund Error: unexpected symbol in "id variable" > 1: 1 1 0.1 Fund A Error: unexpected numeric constant in "1: 1 1" > 2: 4 2 0.1 Fund A Error: unexpected numeric constant in "2: 4 2" – Oden Feb 27 '16 at 12:58
Do your original data have rownames that are characters? If so, does it solve your problem if you create the id's by `1:nrow(...)`? – Heroka Feb 27 '16 at 13:01
The unexpected '>' happends if you try to run code like it's in console (with a '>' at the start, remove it from the line in the script – Heroka Feb 27 '16 at 13:04
I see. I will check the rowsnames – Oden Feb 27 '16 at 13:10
I do have problems with rownames – Oden Feb 27 '16 at 13:13
I added new codes, which uses row indices. Does that solve it? – Heroka Feb 27 '16 at 13:19
You're welcome! I don't want to beat my own drum, but if the question is solved to your satisfaction, could you consider accepting the answer? Keeps others from spending time on the question. – Heroka Feb 27 '16 at 13:37

How to subsetting dataframes based on a condition in R

1 Answers1