-3

Please help me! I have quite big data set containing bank accounts

enter image description here

It is organised in a following way:

V1 - register number of a bank

V2 - date of account value record

V3 - account number

all remaining V-s are for values themselves (in cur, metals, etc)

I need to make a filter through account numbers, remaining everything in the table, but for specific acc numbers. Here is the code I use:

filelist = list.files(pattern = ".txt")

datalist = lapply(filelist, function(x)read.table(x, header=FALSE, sep = ";")) 

all_data = do.call("rbind", datalist) 

r_d <- rename(all_data, c("V1"="Number", "V2"="Dates", "V3"="Account"))
r_d$Account <- as.character(r_d$Account)
f_d <- filter(all_data, r_d$Account >= 42301 & r_d$Account <= 42315 |
    r_d$Account >= 20202 & r_d$Account <= 20210 |
    r_d$Account == 98010 | r_d$Account == 98015)

The problem is that the output of this code is a table containing only NAs, everything becomes NA, even though those acc numbers exist, and I am absolutely sure in that.

If I use Account in filter instead of r_d$Account, R writes me that object Account does not exist. Which I also do not understand.

Please, correct me.

Community
  • 1
  • 1
Chingiz
  • 13
  • 5
  • If the filter function you are using is the one from package `dplyr`, try removing the "r_d$" and just write Account without quotes like so: `filter(all_data, Account >= 42301 ... ` – Pierre Lapointe Mar 26 '17 at 15:45
  • In this case, though @PLapointe's suggestion is valid, the code works identically; it will not work the same when your mid-pipe functions change the data.frame. However, since this is not a [reproducible question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), we're unable to do much. Problems: (1) image of data vice `dput(head(data,n=10))`; (2) use of `rename` is not correct; (3) use of `filter` is ill-advised; (4) you show `filelist` but is it at all relevant here? – r2evans Mar 26 '17 at 15:55
  • BTW: based on your treating of account numbers as ordinal vice categorical, some of your filtering will benefit from the `%in%` and `dplyr::between` functions, arguably making your filter much easier to read. – r2evans Mar 26 '17 at 15:56
  • If I use Account in filter instead of r_d$Account, R writes me that object Account does not exist. Which I also do not understand. – Chingiz Mar 26 '17 at 16:05
  • Why rename function is not correct? I then guess that the problem starts here, R does not understand what to filter and this is why gives me NA-s only, am I right? – Chingiz Mar 26 '17 at 16:06

1 Answers1

0

There are several things wrong with your code. The reason you are getting NAs is that you are passing NULLs all over the place. Did you ever look at r_d$Account? When you see problems in your code, you should start by going things piece-meal step-by-step, and in this case you'll see that r_d$Account gives you NULL. Why? Because you did not rename the columns correctly. colnames(r_d) will be revealing.

First, rename either does non-standard evaluation with un-quoted arguments, or rename_ takes a vector of character=character pairs. These might work (I can't know for certain, since I'm not about to transcribe your image of data ... please provide copyable output from dput next time!):

# non-standard evaluation
rename(all_data, Number=V1, Dates=V2, Account=V3)

# standard-evaluation #1:
rename_(all_data, Number="V1", Dates="V2", Account="V3")

# standard-evaluation #2
rename_(all_data, .dots = c("Number"="v1", "Dates"="V2", "Account"="V3"))

From there, if you step through your code, you should see that r_d$Account is no longer NULL.

Second, is there a reason you create r_d but still reference all-data? There are definitely times when you need to do this kind of stuff; here is not one of them, it is too prone to problems (e.g., if the row-order or dimensions of one of them changes).

Third, because you convert $Account to character, it is really inappropriate to use inequality comparisons. Though it is certainly legal to do so ("1" < "2" is TRUE), it will run into problems, such as "11" < "2" is also TRUE, and "3" < "22" is FALSE. I'm not saying that you should avoid conversion to string; I think it is appropriate. Your use of account ranges is perplexing: an account number should be categorical, not ordinal, so selecting a range of account numbers is illogical.

Fourth, even assuming that account numbers should be ordinal and ranges make sense, your use of filter can be improved, but only if you either (a) accept that comparisons of stringified-numbers is acceptable ("3" > "22"), or (b) keep them as integers. First, you should not be referencing r_d$ within a NSE dplyr function. (Edit: you also need to group your logic with parentheses.) This is a literal translation from your code:

f_d <- filter(r_d, (Account >= 42301 & Account <= 42315) |
    (Account >= 20202 & Account <= 20210) |
    Account == 98010 | Account == 98015)

You can make this perhaps more readable with:

f_d <- filter(r_d,
              Account %in% c(98010, 98015) |
                between(Account, 42301, 42315) |
                between(Account, 20202, 20210)
              )

Perhaps a better way to do it, assuming $Account is character, would be to determine which accounts are appropriate based on some other criteria (open date, order date, something else from a different column), and once you have a vector of account numbers, do

filter(r_d,
       Account %in% vector_of_interesting_account_numbers)
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Got a lot new useful info from your answer! Another problem appeared:( R shows only account numbers 98010 and 98015, but it skips all info between 42301, 42315 and 20202 and 20210:( Don't you why? – Chingiz Mar 26 '17 at 17:30
  • Oh, sorry, I didn't see your comment! With a big pleasure I will do it! Where is this Accept button?) – Chingiz May 15 '17 at 16:10
  • To the (upper-)left of the answer, immediately under the "up/down-vote" arrows. (Only the user who asked the question can see it.) – r2evans May 15 '17 at 16:12