0

I would appreciate some help on the following problem:

I have mutiple huge Logfiles (>1.000.000 entries each) which contain some lines (rows) that are of particular interest for me. So I want to make a subset containing just these lines, but I want to write the result in matrix containing the information for more then one Logfile/Participant. So I created a short line of code to 1. create the subset and 2. run it within a loop, to do it not only for one of the Logfiles, but for all of them.

  Result <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest
  View(Result)

1
interestingCondition1
2
interestingCondition1
3
interestingCondition2
4
interestingCondition1
5
interestingCondition1
6
interestingCondition3
7
interestingCondition2
8
interestingCondition1
9
interestingCondition1
10
interestingCondition1

Embeded into a loop:

WrongResult <- matrix(data=NA,nrow=TrialNumber, ncol=length(ListOfFiles))
vpncount <- 1
for (v in ListOfFiles){

  df<- read.delim(v, header = TRUE, sep='\t')
  WrongResult[,vpncount] <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest

vpncount <- vpncount+1

}

When running the code on one Logfile I get the result I would like, but when running it through the loop, it creates a matrix with the appropiate size, but just filled with "random" numbers instead of the conditions I subdivded for.

Does anyone knows why that happens and how to fix it? Any help is appreciated a lot!

EDIT:

I tried to create an example data frame. The first line of code (including the variable Results) works just as I want it to be. It filters my dataframe on the rows of my columnOfInterest and puts them into a new matrix. But if I try to run it within a loop for more then one dataframe I keep running into errors:

df <- data.frame(
  X = sample(1:10),
  columnOfInterest= sample(c("interestingCondition1", "interestingCondition2", "interestingCondition3", "NotinterestingCondition1"), 10, replace = TRUE)
)

View(df)

Result <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest
View(Result)

WrongResult <- matrix(data=NA,nrow=280, ncol=20)
vpncount <- 1
for (v in 1:20){

  df<- read.delim(v, header = TRUE, sep='\t')
  WrongResult[,vpncount] <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest

  vpncount <- vpncount+1

}

View(WrongResult)
Alex B.
  • 27
  • 6
  • I don't see your sapply function? – DJJ Jun 09 '18 at 09:48
  • Yes, you are rught, I am really sorry. I tested a couple of solution and above I replaced the sapply line with the subset() one. Here is the other idea, which worked as well for me, but the problem of putting it into a loop (therefore running it over multiple dataframes) still exists. – Alex B. Jun 09 '18 at 11:52
  • WrongResult2 <- t(sapply(seq(length = nrow(df)), function(x) df[x, "columnOfIntertest"] %in% c("interstingCondition1", "interstingCondition2", "interstingCondition3"))) View(WrongResult2 ) – Alex B. Jun 09 '18 at 11:53
  • Please correct the typos in your sample data -- `columnOfInterest`, `interestingCondition1` – Martin Morgan Jun 09 '18 at 12:49
  • @MartinMorgan Done, sorry for that! – Alex B. Jun 09 '18 at 13:01

3 Answers3

0

I don't remember how to do it with data.frame so I'll try with data.table. You might have to install the data.table package in case you don't have it install.packages("data.table")

library(data.table)
dt <- data.table(df)

Then you could rewrite your code in the following way

subset..table <- function(dt){
    dt[columnOfInterest %in% c("interestingCondition1",
                               "interestingCondition2",
                               "interestingCondition3"),columnOfInterest]
}


myfun <- function(x){
### DD
    ## x interp string representing  file name

### Purpose
    ## read and subset

    dt <- fread(x,header=TRUE,sep="\t")
    subset..table(dt)

}

res..list <- lapply(ListOfFiles, myfun)

Edit

for instance using your example.

df <- data.frame(
  X = sample(1:10),
  columnOfInterest= sample(c("interestingCondition1",
    "interestingCondition2", "interestingCondition3", 
    "NotinterestingCondition1"), 10, replace = TRUE))


dt <- data.table(df)
subset..table(dt)

would yield

#[1] "interestingCondition2" "interestingCondition3" "interestingCondition1"
#[4] "interestingCondition2" "interestingCondition1" "interestingCondition2"
#[7] "interestingCondition3" "interestingCondition1" "interestingCondition3"

If you are satisfied with the function subset..table, then you just need to use the function myfun to get what you want. The function fread will automatically give you a data.table.

DJJ
  • 2,481
  • 2
  • 28
  • 53
  • I only see `fread()` as dependent on `data.table`. Can't you replace that with `read.delim()` as used by the OP? – AkselA Jun 09 '18 at 11:00
  • Unfortunately this did not work. The error message tells us, that columnOfInterest (object) was not found. Any idea, why that is? – Alex B. Jun 09 '18 at 11:50
  • How? I see nothing that is specific to `data.table` there. – AkselA Jun 09 '18 at 12:02
  • @AlexB.: If you give us a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it will be easier to give you a working solution. – AkselA Jun 09 '18 at 12:07
  • @DJJ columnOfInterest is one column of `code` df `code`. @AkselA I am working on a reproducible example, since my dataframe is to big to be copied directly. – Alex B. Jun 09 '18 at 12:26
  • you don't have to put the entire table just the smallest sample would be enough. read the reproducible example link. It might give you some hint. – DJJ Jun 09 '18 at 12:28
  • @DJJ: I have edited the question and added an example that hopefully illustrated the problem a bit better. – Alex B. Jun 09 '18 at 12:39
  • @DJJ Thanks a lot, I really appreciate it! It worked just fine for one dataframe and probably for more as well, but I haven´t tested it, bacause I took AkselA´s solution. – Alex B. Jun 10 '18 at 08:37
0

In the tidyverse realm when you're processing a single data frame you'd like to filter() and then select() your original data, and for convenience add, using mutate(), the file name. A nice way to filter when there are several possible values is using %in%. So

library(tidyverse)

process_1_df <- function(df, id, condition)
    select(df, columnOfInterest) %>%                 # only interesting column
        filter(columnOfInterest %in% condition) %>%  # specific rows
        mutate(id = id)                              # add identifier

condition <- paste0("interestingCondition", 1:3)
process_1_df(df, "id", condition)

id is meant to be an identifier -- if the data.frame came from file 'foo.txt', then use "foo.txt" for the id. The original question tried to represent data from several files as a matrix, but that assumes that each file has the same number of interesting rows selected. Here the strategy is to create a data frame that contains the file that the interesting condition came from, and the value of the interesting condition. This data frame is useful when processing several files...

This works on the sample data set as:

> condition <- paste0("interestingCondition", 1:3)
> process_1_df(df, "id", condition)
       columnOfInterest id
1 interestingCondition2 id
2 interestingCondition2 id
3 interestingCondition3 id
4 interestingCondition1 id
5 interestingCondition3 id
6 interestingCondition1 id

You might extend this to process a file

process_1_file <- function(file_name, condition)
    read_csv(file_name) %>%                   # better: input only columnOfInterest
        process_1_df(file_name, condition)

As @DJJ suggests, a data.table implementation of process_1_file() is likely to be very compact and efficient -- fread(file_name)[columnOfInterest %in% condition, columnOfInterest]

To process several files, use the purr package

library(purrr)
process_files <- function(file_names, condition)
    map(file_names, process_1_file, condition) %>%
        bind_rows()

dir(pattern="*.csv") %>% process_files(condition)

The end result is a single data frame, with a column of interesting conditions and another column indicating which log file the interesting condition came from. This 'long'-format data frame can now be processed / summarized as desired.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thanks a lot for your effort! I tried to use the code, but the result is not filtered for. The resulting matrix consistes of all 10 rows of the provided sample dataframe and the NotinterestingCondition1 is still included. I can´t really follow your code, since the mentioned packages and R in general is pretty new for me. Do you now why this happens? – Alex B. Jun 09 '18 at 13:19
  • did you ensure that the typos were corrected? I updated my post to demonstrate that it works. – Martin Morgan Jun 09 '18 at 13:41
  • Copying your updated code I still get the following error message: ' Error in filter_impl(.data, quo) : Evaluation error: argument "condition" is missing, with no default.' – Alex B. Jun 09 '18 at 13:46
  • did you define the condition variable, `condition <- paste0("interestingCondition", 1:3)` indicating the condition(s) of interest? – Martin Morgan Jun 09 '18 at 13:53
  • Yes I did and it works proparly: `condition [1] "interestingCondition1" "interestingCondition2" "interestingCondition3`. Still the function is not working: `process_1_df(df, condition) Error in filter_impl(.data, quo) : Evaluation error: argument "condition" is missing, with no default.` – Alex B. Jun 09 '18 at 14:00
  • Sorry, I corrected my 'illustration' but not the original script; see the update – Martin Morgan Jun 09 '18 at 14:03
  • Nice, `process_1_df` works just as I would like it to work, thanks a lot. Only thing there I don´t understand is `id`. What is it used for and what should I edit there? But even more important is the usage for multiple data frames. My Logfiles are `.txt` and not csv. What part of your code do I have to edit then? – Alex B. Jun 09 '18 at 14:21
  • I tried to add explanation for id to the answer, and the rationale for creating a data frame for each log file. You'll need to modify the step `read_csv(file_name)` to perform whatever operation will get your data from the text file into R as a data frame; only you know the format of the text file, so I can't help you there... – Martin Morgan Jun 09 '18 at 14:28
  • Thank you very much. Modifying `read_csw(file_name)` probably would have worked just fine, but I haven´t tested it, bacuase AkselA´s solution worked fine for me as well. – Alex B. Jun 10 '18 at 08:39
  • not modify `read_csv()` but if AkselA's code works for your files then replace `read_csv(file_name)` with `read.delim(file_name, header = TRUE, sep="\t", stringsAsFactors=FALSE)` ; glad you found a solution anyway – Martin Morgan Jun 10 '18 at 12:13
0

Does anyone knows why that happens?

Your loop is… not working. The reasons are a bit complex, but I've made a working example in base R using simple loops (no *apply functions), hopefully you can follow along, and hopefully it represents your problem to a sufficient degree.

Learn to walk before you run. Learn basic loops before you learn how to do it more succinctly with apply(), lapply() etc. Learn standard evaluation (i.e. regular use of the programming language R itself) before you delve into non-standard evaluation (data.table, tidyverse, purrr etc.)

First we'll create some data frames and write them to files

owd <- getwd()
dir.create("sotest")
setwd("sotest")

set.seed(1)

flist <- c("dtf1.txt", "dtf2.txt", "dtf3.txt")

for (i in 1:length(flist)) {
    dtf <- data.frame(
      X=sample(1:10),
      coi=sample(c("ic1", "ic2", "ic3", "nic1"), 10, replace=TRUE)
    )
    write.table(dtf, flist[i], row.names=FALSE, sep="\t")
}

After running this you should have a folder named "sotest" containing three tab-separated txt-files.

Then we'll get a list of available files, and loop over that.

flist <- list.files(pattern=".txt")
WrongResult <- list()
interesting <- c("ic1", "ic2", "ic3")

for (v in 1:length(flist)) {

    dtf <- read.delim(flist[v], header=TRUE, sep="\t", stringsAsFactors=FALSE)
    WrongResult[[v]] <- dtf[dtf$coi %in% interesting, "coi"]

}

WrongResult

setwd(owd)

I store the output is a list instead of a matrix, as the length of the object produced in each iteration of the loop isn't the same.

AkselA
  • 8,153
  • 2
  • 21
  • 34
  • Thank you so much! Your solution worked really nice for me and I appreciate it alot that you took some extra time and effort to explain it to beginners and with an approach not involving a lot of new packages. I could follow it just fine and use it for my problem. Thanks a million! – Alex B. Jun 10 '18 at 08:40