-1

I have big .csv file. I would like to filter that file into a new table.

For example, I have .csv file as below:

   f1 f2  f3 f4  f5  f6 f7 f9  f10  f11 
t1  1  0  1   0  1  0   0  0   0    1 
t2  1  0  0   0  0  1   1  1   1    1 
t3  0  0  0   0  0  0   0  0   0    0 
t4  1  0  0   0  1  0   0  0   0    0 
t5  0  0  0   0  0  0   0  0   0    0 
t6  0  0  0   0  0  0   0  0   0    0 
  1. I have a table (as above)

  2. What I want to do is, I want to have new table for each row (meaning that, I will have new table for all rows. e.g, new table for row t1, new table for row t2, new table for row t3 and etc). As in this example, I should have 6 new tables.

  3. To develop new table for each row, there is a condition that need to meet. New table should look at every value in each column. And if the column has a same value with other column in other row (which is value 1), it should be grouped together.

As in this example, new table for t1 will consist t1,t2,t4 because value in column f1 have the same value (which is 1) with the value in f1 for row t2 and t4, also value in f5 is equal with the value in f5 for row t4, and value in f11 is equal with the value in f11 for row t2). So, thats mean, it will check every column. One of the output for should be like this:

       f1 f2  f3 f4  f5  f6 f7 f9  f10  f11 
    t1  1  0  1   0  1  0   0  0   0    1 
    t2  1  0  0   0  0  1   1  1   1    1 
    t4  1  0  0   0  1  0   0  0   0    0 
  1. As for t2, row t2 should be grouped with t4 because value in f1 in t1 and value f1 in t4 is equal. However, t2 should not consider the earlier row (as in this example, it should not consider t1). Output should be like this:

      f1 f2  f3 f4  f5  f6 f7 f9  f10  f11 
    t2  1  0  0   0  0  1   1  1   1    1 
    t4  1  0  0   0  1  0   0  0   0    0 
    
  2. Similar to other rows (row t3,t4,t5 and t6), it should look at every value in each column. And if the column has a same value with other column in other row (which is value 1), it should be grouped together.

  3. New table (with row and column header) should be then saved in a new .csv file. The file should be renamed using its row name. for example, as for t1, it should be saved as t1.csv.

  4. This is only a simple example. The proposed solution here will be applied in other big table of data. I need to read abc.csv file. Meaning that, it might be more than 100+ new table will be created (when I used original data).

so far i used this code:

a.files <- grep("^Task_vs_Files", dir(), value=TRUE) 
a.files

for(i in 1:length(a.files))
   dat <- read.table(file=a.files[i], header=T, sep=",", row.names=1) 


      (sapply(1:nrow(dat), function(x) if (dat[x,]==1)  #check row
            (sapply(1:nrow(dat), function(y) if (dat[,y]==1) #check column

            { 
                   write.csv( dat[(dat[[x,y]]==1 ) & (1:nrow(dat) >= x) , ] , file = paste("Files_", x) ) #save file based on row names
            } 
            else {NULL} ))

output from a.files:

[1] "Task_vs_Files_Proj.csv"  "Task_vs_Files_Whirr.csv"

dataset from one of the file (Task_vs_Files_Proj.csv)

       pom.xml. ZooKeeper.java HBase.java Hadoop.java. BasicServer.java. Abstract.java. HBaseRegion.java
WHIRR-25        1              0          1            0                 1              1                1
WHIRR-28        1              0          1            0                 0              1                0
WHIRR-55        0              0          1            0                 0              0                0
WHIRR-61        0              0          0            0                 0              1                0
WHIRR-76        0              0          1            0                 0              0                0
WHIRR-87        1              1          1            0                 0              1                1
WHIRR-92        1              0          0            1                 0              1                1

Appreciate help from the expert!

user1676484
  • 197
  • 2
  • 2
  • 11
  • So you are only interested in grouping by "1"-values in 'f1'? Also... there are no commas in that file. – IRTFM Sep 17 '12 at 05:53
  • Also, I think the column names are incorrect starting with `f9` (should be `f8, f9, f10`?). Are you only filtering on columns `f1, f5 and f10`? – BenBarnes Sep 17 '12 at 07:50
  • I think OP wants all cases that have common values in at least one column. – Roman Luštrik Sep 17 '12 at 08:50
  • @RomanLuštrik you may be right, but his sample is ambiguous, since the only rows with common column values are 1,2,3; and they "win" based on column 1 alone. A simple, clumsy solution would be to find all columns for which colSum>=2 and select every row whose value ==1 in at least one of those columns. – Carl Witthoft Sep 17 '12 at 12:06
  • Agreed, I think (s)he should elaborate a bit more. – Roman Luštrik Sep 17 '12 at 12:45
  • @Dwin Yes, generally I'm interested in grouping row(s) which have value same value 1 in each column. It is .csv file – user1676484 Sep 17 '12 at 22:54
  • @BenBarnes yes, sorry for that, it should be (fb,f9,f10)...I'm filtering all the columns from f1 to f10 – user1676484 Sep 17 '12 at 22:55
  • @CarlWitthoft what I'm trying to do is to chunk the big .csv table into a new table. New table should consists a row (in this example t1-t6) which have the same value (which is 1) in a column (f1-f10). In this example, t1,t2 and t4 should be grouped together because they have same value in f1,f5 and f11. – user1676484 Sep 17 '12 at 23:07
  • I've tried to edit the question so it makes more sense to a native English speaker, but you should revert it if I have failed to match your meaning. – IRTFM Sep 17 '12 at 23:11
  • 1
    The `sapply` function is basically the same as a for-loop. If you want to run the same procedure on all columns then you need to wrap another `sapply` around the code I gave you that runs through one column except you will be passing in an column index (perhaps named 'y') instead of the "f1". – IRTFM Sep 18 '12 at 06:44
  • @DWin, I tried to used another sapply and used column index (named 'y'). It runs without error, but there is no output appear. – user1676484 Sep 19 '12 at 02:11
  • You may need to assign the value to a name. Other than that without the code we are in the dark. – IRTFM Sep 19 '12 at 03:14
  • It's difficult to tell with that code in that form, because the comments may have messed up the logic. (When will people learn to put their edits inside the question box? That is what the edit button is for.) – IRTFM Sep 19 '12 at 04:18
  • @DWin edited. sorry for that. I learn a lot from you. – user1676484 Sep 19 '12 at 04:25

1 Answers1

1
 sapply(1:nrow(dat), function(x) if (dat[x, "f1"]==1) { 
           write.csv( dat[ (dat[["f1"]]==1 )& (1:nrow(dat) >= x) , ])
            } else {NULL} )
"","f1","f2","f3","f4","f5","f6","f7","f9","f10","f11"
"t1",1,0,1,0,1,0,0,0,0,1
"t2",1,0,0,0,0,1,1,1,1,1
"t4",1,0,0,0,1,0,0,0,0,0
"","f1","f2","f3","f4","f5","f6","f7","f9","f10","f11"
"t2",1,0,0,0,0,1,1,1,1,1
"t4",1,0,0,0,1,0,0,0,0,0
"","f1","f2","f3","f4","f5","f6","f7","f9","f10","f11"
"t4",1,0,0,0,1,0,0,0,0,0
[[1]]
NULL

Will need to construct file names:

invisible(
  sapply(1:nrow(dat), function(x) if (dat[x, "f1"]==1) { 
           write.csv( dat[ (dat[["f1"]]==1 )& (1:nrow(dat) >= x) , ] ,
                    file = paste0("fil_", x, ".csv") )
                                } else {NULL} )
         )   

If this is destined for Excel, as I fear it might be, note that the rownames are included but no column header will be created to designate the rownames.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    You should devote more effort toward making a more complete example of the desired output for each set of criteria in your question. I've gone as far as I can on guesswork. – IRTFM Sep 18 '12 at 00:46
  • hi Dwin, thank you for the help. It is working, but this is not what I'm trying to do. Let me try to explain again. I'm trying my best to explain (sorry, i'm not native english speaker). 1. I have a table 2. I want to chunk a new table for each row (meaning that, I will have new table for row t1, new table for row t2, new table for row t3 and etc). ...to be continued – user1676484 Sep 18 '12 at 00:47
  • 3. To develop new table for each row, there is a condition that need to meet. The row should be grouped with other row which have the same value in column (as in my example, t1,t2,t4 should be grouped together because value in column f1 is equal to 1 for row t2 and t4, also value in f5 is equal with the value in f5 for row t4, and value in f11 is equal with the value in f11 for row t2). – user1676484 Sep 18 '12 at 00:47
  • Also, from the big table above, t2 also should should have another table because value in f1 in t1 and value f1 in t4 is equal. – user1676484 Sep 18 '12 at 00:49
  • When you say "should have another table" you are not showing us what it should like like.... and you SHOULD be doing this _in_the_question. .... NOT in some tiny comment. (Also I thought the first part of my code was doing exactly what you wanted but you say it aint't so which adds to the need for a COMPLETE EXAMPLE.) – IRTFM Sep 18 '12 at 00:52
  • So in this example you want three files created, the first one with 3 lines (t1,t2,t4), the second with two lines(t2,t4) and the third one with only one line (t4)? – IRTFM Sep 18 '12 at 01:40
  • as for this example, yes. the proposed solution here will be applied in other big table of data. Meaning that, it might be more than 100+ new table will be created (when I used original data). – user1676484 Sep 18 '12 at 01:48
  • The solution should be general. – IRTFM Sep 18 '12 at 01:51
  • as for my cases, do I need to replace "f1" with the first column name? – user1676484 Sep 18 '12 at 02:14
  • Yes, or you could refer to it by it's position. – IRTFM Sep 18 '12 at 02:16
  • I tried to replace it with the first column name, but I got an error msg "Error in if (dat[x, "pom.xml"] == 1) { : argument is of length zero" – user1676484 Sep 18 '12 at 02:19
  • i've changed the column name to column position, but it return NULL – user1676484 Sep 18 '12 at 03:06
  • You need to post str(dat) or dput(head(dat)). There a really good Q&A on posting reproducible examples: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – IRTFM Sep 18 '12 at 03:35
  • what if I want to check all condition in column? As for this, from my understanding, if i used dat[,2], it will only check column 2. What if I want to check all columns? – user1676484 Sep 18 '12 at 05:49
  • You need to say what you mean by "check all columns". What test is being applied? – IRTFM Sep 18 '12 at 06:33
  • sorry I have edit the question, to make it more clearer. Appreciate your help! – user1676484 Sep 18 '12 at 06:38