1

I have got quite a good experience with C programming and I am used to think in terms of pointers, so I can get good performance when dealing with huge amount of datas. It is not the same with R, which I am still learning.

I have got a file with approximately 1 million lines, separated by a '\n' and each line has got 1, 2 or more integers inside, separated by a ' '. I have been able to put together a code which reads the file and put everything into a list of lists. Some lines can be empty. I would then like to put the first number of each line, if it exists, into a separated list, just passing over if a line is empty, and the remaining numbers into a second list.

The code I post here is terribly slow (it has been still running since I started wrote this question so now I killed R), how can I get a decent speed? In C this would be done instantly.

graph <- function() {
    x <- scan("result", what="", sep="\n")
    y <- strsplit(x, "[[:space:]]+") #use spaces for split number in each line
    y <- lapply(y, FUN = as.integer) #convert from a list of lists of characters to a list of lists of integers
    print("here we go")
    first <- c()
    others <- c()
    for(i in 1:length(y)) {
        if(length(y[i]) >= 1) { 
            first[i] <- y[i][1]
        }
        k <- 2;
        for(j in 2:length(y[i])) {
            others[k] <- y[i][k]
            k <- k + 1
        }
    }

In a previous version of the code, in which each line had at least one number and in which I was interested only in the first number of each line, I used this code (I read everywhere that I should avoid using for loops in languages like R)

yy <- rapply(y, function(x) head(x,1))

which takes about 5 second, so far far better than above but still annoying if compared to C.

EDIT this is an example of the first 10 lines of my file:

42 7 31 3 
23 1 34 5 


1 
-23 -34 2 2 

42 7 31 3 31 4 

1
Nisba
  • 3,210
  • 2
  • 27
  • 46
  • Is your file a CSV? Also, could you share examples of your 'numbers', please? Perhaps you could say which of these might be a number in your file: "1 2", "1 23", "1 2 3". – PDE Oct 10 '17 at 14:53
  • @PDE No it is just of the format described above. I generate the file myself using a C program. If you prefer I can create a CSV file but I would like to learn the code fro my very problem. All the numbers you wrote are valid, to be precise, in my case I have always numbers from -74 to 50 and I do not have more than 6 numbers in each line. I do not use a binary format because I want to easily go trough the data with emacs – Nisba Oct 10 '17 at 14:56
  • The loop is the only slow part ? – moodymudskipper Oct 10 '17 at 14:59
  • @Moody_Mudskipper yes – Nisba Oct 10 '17 at 15:00
  • @Nisba By the way, it would help you and the community a lot if you could share multiple examples of what your data looks like. When you say you have numbers from -74 to 50 and you have at most six numbers per line, we do not know whether you have a tabular data with six columns per row. Or data with one column of alphanumeric characters separated by spaces. Or anything else. – PDE Oct 10 '17 at 15:23
  • 1
    @Nisba You could also try the approach in this StackOverflow list: https://stackoverflow.com/questions/8299978/splitting-a-string-on-the-first-space – PDE Oct 10 '17 at 15:46

5 Answers5

2

Base R versus purrr

your_list <- rep(list(list(1,2,3,4), list(5,6,7), list(8,9)), 100)

microbenchmark::microbenchmark(
  your_list %>% map(1),
  lapply(your_list, function(x) x[[1]])
)
Unit: microseconds
                                  expr       min        lq       mean    median         uq       max neval
                  your_list %>% map(1) 22671.198 23971.213 24801.5961 24775.258 25460.4430 28622.492   100
 lapply(your_list, function(x) x[[1]])   143.692   156.273   178.4826   162.233   172.1655  1089.939   100

microbenchmark::microbenchmark(
  your_list %>% map(. %>% .[-1]),
  lapply(your_list, function(x) x[-1])
)
Unit: microseconds
                                 expr     min       lq      mean   median       uq      max neval
       your_list %>% map(. %>% .[-1]) 916.118 942.4405 1019.0138 967.4370 997.2350 2840.066   100
 lapply(your_list, function(x) x[-1]) 202.956 219.3455  264.3368 227.9535 243.8455 1831.244   100

purrr isn't a package for performance, just convenience, which is great but not when you care a lot about performance. This has been discussed elsewhere.


By the way, if you are good in C, you should look at package Rcpp.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • You're looping on 4 elements only though, so the overhead costs are amplified. OP can you confirm the base solutions were faster on your full data and to which extent ? – moodymudskipper Oct 12 '17 at 20:08
  • Also your comparison is fair to test my solution against the base solution, but unfair to `map` because there's also an overhead due to the pipes (2 of them), and possibly the evaluation of the dot. – moodymudskipper Oct 12 '17 at 20:12
  • @Moody_Mudskipper I'm looping on 300 elements. You may increase the size if you want. – F. Privé Oct 13 '17 at 06:50
1

try this:

your_list <- list(list(1,2,3,4),
     list(5,6,7),
     list(8,9))

library(purrr)

first <- your_list %>% map(1)
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 5
# 
# [[3]]
# [1] 8

other <- your_list %>% map(. %>% .[-1])    
# [[1]]
# [[1]][[1]]
# [1] 2
# 
# [[1]][[2]]
# [1] 3
# 
# [[1]][[3]]
# [1] 4
# 
# 
# [[2]]
# [[2]][[1]]
# [1] 6
# 
# [[2]][[2]]
# [1] 7
# 
# 
# [[3]]
# [[3]][[1]]
# [1] 9

Though you might want the following, as it seems to me those numbers would be better stored in vectors than in lists:

your_list %>% map(1) %>% unlist # as it seems map_dbl was slow
# [1] 1 5 8
your_list %>% map(~unlist(.x[-1]))
# [[1]]
# [1] 2 3 4
# 
# [[2]]
# [1] 6 7
# 
# [[3]]
# [1] 9
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • @Nisba this isn't what you want ? – moodymudskipper Oct 10 '17 at 15:21
  • I am reading it right now, it seems exactly what I am looking for, I will try the solution in minutes. You are right using vector is more suitable for my purpose. – Nisba Oct 10 '17 at 15:25
  • 1
    @Moody_Mudskipper Simply using `lapply(your_list, function(x) x[[1]])` should be faster – F. Privé Oct 10 '17 at 16:06
  • Apparently they don't differ much: https://groups.google.com/forum/#!topic/davis-rug/DIofOdFZgHI – moodymudskipper Oct 10 '17 at 16:18
  • this is the best solution so far, `your_list %>% map(. %>% .[-1] %>% unlist))` is pretty "fast" (5 seconds), however the fist one map_dbl(1) takes about 1 minute. So so far this is the best solution but it is far from being fast... :( – Nisba Oct 10 '17 at 19:41
  • @F.Privé you are right! This is the fastest solution! Few seconds!!!! You should create an answer – Nisba Oct 10 '17 at 19:44
  • I edited to use only standard map, and rewrote the last one (but it's doing strictly the same, just sexier syntax) – moodymudskipper Oct 10 '17 at 22:31
  • `sapply(your_list,function(x) x[[1]])` and `lapply(your_list,function(x) unlist(x[-1]))` would be the base R equivalent – moodymudskipper Oct 10 '17 at 22:35
0

Indeed, coming from C to R will be confusing (it was for me). What helps for performance is understanding that primitive types in R are all vectors implemented in highly optimized, natively-compiled C and Fortran, and you should aim to avoid loops when there's a vectorized solution available.

That said, I think you should load this as a csv via read.csv(). This will provide you with a dataframe with which you can perform vector-based operations.

For a better understanding, a concise (and humorous) read is http://www.burns-stat.com/pages/Tutor/R_inferno.pdf.

anthonyserious
  • 1,758
  • 17
  • 15
  • Thank you I will try. I was looking for something like the book you suggested me, I will read it! – Nisba Oct 10 '17 at 15:18
0

I would try to use stringr package. Something like this:

set.seed(3)
d <- replicate(3, sample(1:1000, 3))
d <- apply(d, 2, function(x) paste(c(x, "\n"), collapse = " "))
d
# [1] "169 807 385 \n" "328 602 604 \n" "125 295 577 \n"


require(stringr)
str_split(d, " ", simplify = T)
# [,1]  [,2]  [,3]  [,4]
# [1,] "169" "807" "385" "\n"
# [2,] "328" "602" "604" "\n"
# [3,] "125" "295" "577" "\n"

Even for large data it is fast:

d <- replicate(1e6, sample(1:1000, 3))
d <- apply(d, 2, function(x) paste(c(x, "\n"), collapse = " "))
d
system.time(s <- str_split(d, " ", simplify = T)) #0.77 sek
minem
  • 3,640
  • 2
  • 15
  • 29
  • thanks, but what about splitting the each line in two list of numbers? One for the first column and one for the remaining? That is the slow part of my code – Nisba Oct 10 '17 at 15:16
  • why do you need lists? in R lists are mush slower than vectors and matrices. – minem Oct 11 '17 at 05:48
  • That's a good point, in fact using arrays and F. Privé's solution now the code runs decently! – Nisba Oct 12 '17 at 11:45
0

Assuming the files are in a CSV, and that all of the 'numbers' are strictly of the form 1 2 or -1 2 (i.e., 1 2 3 or 1 23 are not allowed in the file), then one could start by coding:

# Install package `data.table` if needed
# install.packages('data.table')

# Load `data.table` package
library(data.table)

# Load the CSV, which has just one column named `my_number`.
# Then, coerce `my_number` into character format and remove negative signs.
DT <- fread('file.csv')[, my_number := as.character(abs(my_number))]

# Extract first character, which would be the first desired digit 
# if my assumption about number formats is correct.
DT[, first_column := substr(my_number, 1, 1)]

# The rest of the substring can go into another column.
DT[, second_column := substr(my_number, 2, nchar(my_number))].

Then, if you still really need to create two lists, you could do the following.

# Create the first list.
first_list <- DT[, as.list(first_column)]

# Create the second list.
second_list <- DT[, as.list(second_column)]
PDE
  • 119
  • 5
  • I think I can support multiple digits number with your solution if I create the file padding the numbers with zero. Anyway my rows are not always of the same length, so I will keep in mind for the future, thank you! – Nisba Oct 10 '17 at 15:14
  • At least given my understanding that all you want is to store the first 'digit' of your 'number' as `first_list` and the rest of the 'number' as `second_list`, then my solution does not have a problem with your 'numbers' having different lengths. `second_column` is generated as the substring of your number starting from the second character (which, as I understand it, is an empty space) to the last character of that 'number' howsoever many characters that 'number' may have. – PDE Oct 10 '17 at 15:19
  • Oh there was a misunderstanding: I need to store every first number and every other number in other list, not digits! – Nisba Oct 10 '17 at 15:21
  • @Nisba As I requested above, an example of what your data looks like would be nice. Currently, I assume your data has just one column per row: `my_number "1 2" "1 2 3" "1 23" "-1 2" "-1 23"` And I assume you want to generate `first_list` to look like: `"1" "1" "1" "1" "1"` .... and so on. And I assume you want to generate `second_list` would be ok if it says `" 2" " 2 3" " 23" " 2" " 23"`. – PDE Oct 10 '17 at 15:25
  • I edited my question for a better explanation of the format – Nisba Oct 10 '17 at 15:29