I am working with a matrix set_onco
of 206 rows x 196 cols and I have a vector, genes_100
(it's a matrix but I take only the first col), with 101 names.
here's a snippet of how they look
> set_onco[1:10,1:10]
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
GLI1_UP.V1_DN COPZ1 C10orf46 C20orf118 TMEM181 CCNL2 YIPF1 GTDC1 OPN3 RSAD2 SLC22A1
GLI1_UP.V1_UP IGFBP6 HLA-DQB1 CCND2 PTH1R TXNDC12 M6PR PPT2 STAU1 IGJ TMOD3
E2F1_UP.V1_DN TGFB1I1 CXCL5 POU5F1 SAMD10 KLF2 STAT6 ENTPD6 VCAN HMGCS1 ANXA8
E2F1_UP.V1_UP RRP1B HES1 ADCY6 CHAF1B VPS37B GRSF1 TLX2 SSX2IP DNA2 CMA1
EGFR_UP.V1_DN NPY1R PDZK1 GFRA1 GREB1 MSMB DLC1 MYB SLC6A14 IFI44 IFI44L
EGFR_UP.V1_UP FGG GBP1 TNFRSF11B FGB GJA1 DUSP6 S100A9 ADM ITGB6 DUSP4
ERB2_UP.V1_DN NPY1R PDZK1 ANXA3 GREB1 HSPB8 DLC1 NRIP1 FHL2 EGR3 IFI44
FAM18B1
ERB2_UP.V1_UP CYP1A1 CEACAM5 FAM129A TNFRSF11B DUSP4 CYP1B1 UPK2 DAB2 CEACAM6 KIAA1199
GCNP_SHH_UP_EARLY.V1_DN SRRM2 KIAA1217 DEFA1 DLK1 PITX2 CCL2 UPK3B SEZ6 TAF15 EMP1
genes_100[1:10,1]
[1] AL591845.1 B3GALT6 RAP1GAP HSPG2 BX293535.1 RP1-159A19.1 IFI6 FAM76A FAM176B CSF3R
101 Levels: 5_8S_rRNA AC018470.1 AC091179.2 AC103702.3 AC138972.1 ACVR1B AL049829.5 AL137797.2 AL139260.2 AL450326.2 AL591845.1 AL607122.2 B3GALT6 BX293535.1 ... ZNF678
what I want to do is to parse through the matrix and count the frequency at which each row contains the names in genes_100
to do that I created 3 for loops: the first one moves down one row at the time, the second one moves into the row and the third one loops over the list genes_100
checking for matches.
at the end I save in a matrix how many times genes_100
matched with the terms in each row, saving also the row names from the matrix (so that I know which one is which)
the code works and gives me the correct output...but it's just really slow!!
a snippet of the output is:
head(result_matrix_100)
freq_100
[1,] "GLI1_UP.V1_DN" "0"
[2,] "GLI1_UP.V1_UP" "0"
[3,] "E2F1_UP.V1_DN" "0"
[4,] "E2F1_UP.V1_UP" "0"
[5,] "EGFR_UP.V1_DN" "0"
[6,] "EGFR_UP.V1_UP" "0"
I used system.time()
and I get:
user system elapsed
525.38 0.06 530.34
which is way too slow since I have even bigger matrices to parse, and in some cases I have to repeat this 10k times!!!
the code is:
result_matrix_100 <- matrix(nrow=0, ncol=2)
for (q in seq(1,nrow(set_onco),1)) {
for (j in seq(1, length(set_onco[q,]),1)) {
for (x in seq(1,101,1)) {
if (as.character(genes_100[x,1]) == as.character(set_onco[q,j])) {
freq_100 <- freq_100+1
}
}
}
result_matrix_100 <- rbind(result_matrix_100, cbind(row.names(set_onco)[q], freq_100))
}
what would you suggest?
thanks in advance :)