0

I asked a question and I received a great answer which solved my problem. However, I want to modify the code (here is my previous question).

finding similar strings in each row of two different data frame

I try to explain again the problem and how I tried to deal with it

The answer by Karsten W. gave me a normalised data (assign each string in each element a number of its position) as follow (I did not change it)

normalize <- function(x, delim) {
    x <- gsub(")", "", x, fixed=TRUE)
    x <- gsub("(", "", x, fixed=TRUE)
    idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
    names <- unlist(strsplit(as.character(x), delim))
    return(setNames(idx, names))
}

The second part was to apply the above function on each column separately, so if i need to do that on 1000 columns it is very time consuming. instead I do the following in comment , I tried to use lappy

# s1 <- normalize(df1[,1], ";")
# s2 <- normalize(df1[,2], ";")

I do like this

myS <- lapply(df1, normalize,";") 

I keep the other part as it is

lookup <- normalize(df2[,1], ",")

Then to check between the two, I modified the function to only keep the row numbers of df2 (I removed (s[found] from it)

process <- function(s) {
    lookup_try <- lookup[names(s)]
    found <- which(!is.na(lookup_try))
    pos <- lookup_try[names(s)[found]]
    return(paste(pos, sep=""))
}

then whatever I do, I cannot get the output

process(myS$sample1) ...

At the end I need to have the data in a txt file or something which I can read. I used write.table but this does not work. Is there any better way to do this? How to do it automatically?

Community
  • 1
  • 1
nik
  • 2,500
  • 5
  • 21
  • 48
  • Haven´t got the time to look very closely, but on the way out: did you have a look at the package plyr -> ddply and colwise? – Buggy Feb 27 '16 at 10:20
  • Is it a typo? `process(myS$sample_1)` instead of `...(myS$sample1)` – jogo Feb 27 '16 at 10:36
  • @jogo I am looking for making it automatic. thanks I revised above. – nik Feb 27 '16 at 10:39

1 Answers1

2

It is a typo. process(myS$sample_1) instead of ...(myS$sample1)
I get:

> process(myS$sample_1)
[1] "4" "1" "4"

and

> lapply(myS, process)
$sample_1
[1] "4" "1" "4"

$sample_2
[1] "4"  "15" "16"

IMHO for the function process() it would be better to return an integer vector:

process <- function(s) {
  lookup_try <- lookup[names(s)]
  found <- which(!is.na(lookup_try))
  pos <- lookup_try[names(s)[found]]
  names(pos) <- NULL
  pos
}

For putting the result in a dataframe:

r <- lapply(myS, process)

m <- max(sapply(r, length))
r.matrix <- matrix(NA, m, length(r))
for (j in 1:length(r)) {
  x <- r[[j]]
  length(x) <- m
  r.matrix[,j] <- x
}
colnames(r.matrix) <- names(r)
r.df <- as.data.frame(r.matrix)
jogo
  • 12,469
  • 11
  • 37
  • 42
  • do you know any better way to do it instead applying such functions ? I already liked your answer – nik Feb 27 '16 at 10:46
  • For me it works: `df1 <- read.table("df1.txt", sep="\t", header=TRUE, nrows=97, stringsAsFactors=FALSE); df2 <- read.table("df2.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)` ... I get a result without errors at `lapply(myS, process)` I can not reproduce the error you mentioned. Did you used `stringsAsFactors=FALSE` during import? It seems that you didn't. – jogo Feb 27 '16 at 13:15
  • lapply(myS, process) does not reproduce any error but I cannot save the results , I cannot see the results etc – nik Feb 27 '16 at 13:27
  • `result <- lapply(myS, process); result` The result is a list. But you can not save it in a dataframe because in a dataframe all column vectors have the same length. – jogo Feb 27 '16 at 13:32
  • I need to have the data in a txt file or something which I can read. I used write.table but does not work , – nik Feb 27 '16 at 13:33
  • http://stackoverflow.com/questions/16012930/convert-a-list-of-numeric-vectors-with-different-lengths-to-data-frame – jogo Feb 27 '16 at 13:36
  • I have tried that it gives me error Error in data.frame(Fraction_1 = c(393L, 674L, 79L, 2447L, 248L), Fraction_2 = c(2107L, : arguments imply differing number of rows: 5, 30, 51, 35 – nik Feb 27 '16 at 13:45
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/104722/discussion-between-jogo-and-mol). – jogo Feb 27 '16 at 13:45