Questions tagged [multidplyr]

multidplyr is an R package by Hadley Wickham that enables parallel processing on partitioned data.frames. This tag should not be used for dplyr-only questions.

multidplyr is an R package by Hadley Wickham that enables parallel processing on partitioned data.frames. It is a complement to his popular dplyr package and part of the extended tidyverse ecosystem of packages.

51 questions
12
votes
1 answer

Parallel computing, which alternative to tidyr::complete in dplyr?

I am trying to parallelise a pipe. In the pipe there is a tidyr command ("tidyr::complete"). This breaks down the code once run in parallel, as the object class is not recognised. Is there an alternative in dplyr to…
MCS
  • 1,071
  • 9
  • 23
9
votes
1 answer

Replacement for parallel plyr with doMC

Consider a standard grouped operation on a data.frame: library(plyr) library(doMC) library(MASS) # for example nc <- 12 registerDoMC(nc) d <- data.frame(x = c("data", "more data"), g = c("group1", "group2")) y <- "some global object" res <-…
Devin
  • 851
  • 12
  • 32
9
votes
1 answer

multidplyr and group_by () and filter()

I have the following dataframe and my intention is to find all the IDs, that have different USAGE but the same TYPE. ID <- rep(1:4, each=3) USAGE <-…
Justas Mundeikis
  • 935
  • 1
  • 10
  • 19
5
votes
1 answer

Calling a function with arguments within dplyr::do using multidplyr

I'm trying to use multidplyr to speed up getting residuals from a regression fit. I've created a function that fits the regression model to get the residuals, which in addition to the data, gets two more arguments. Here's the function: func <-…
dan
  • 6,048
  • 10
  • 57
  • 125
3
votes
1 answer

How to export custom functions to clusters in multidplyr?

Following on from questions here and here, I'm trying to get the latest version of multidplyr to work with a custom function. By way of reproducible example, I have tried: library(multidplyr) library(dplyr) cl <- new_cluster(3) df <- data.frame(Grp…
Will T-E
  • 607
  • 1
  • 7
  • 16
3
votes
1 answer

R: What is a fast way to remove dominated rows from a table?

I'm looking for a fast way to remove all dominated rows from a table (preferably using parallel processing, to take advantage of multiple cores). By "dominated row", I mean a row that is less than or equal to another row in all columns. For example,…
kartik_subbarao
  • 228
  • 3
  • 15
3
votes
1 answer

how to split by multiple columns when using multidplyr

tl;dr How do I make "partition" from multiplyr split on multiple columns? Motivation: I was unhappy with using 1 of 32 cores for hard-working summarize, so I am trying to use multi-dplyer I am operating on multiple columns. Example: The vignette…
EngrStudent
  • 1,924
  • 31
  • 46
3
votes
1 answer

multidplyr : assign functions to cluster

(see working solution below) I want to use multidplyr to parallelize a function : calculs.R f <- function(x){ return(x+1) } main.R library(dplyr) library(multidplyr) source("calculs.R") d <- data.frame(a=1:1000,b=sample(1:2,1000),replace=T) result…
Xavier Prudent
  • 1,570
  • 3
  • 25
  • 54
3
votes
0 answers

Multiplyr and prophet for parallel grouped prediction: Error in checkForRemoteErrors(lapply(cl, recvResult))

I am willing to make parallel predictions using multidplyr and prophet. Consider the following data library(tidyr) library(dplyr) library(multidplyr) library(prophet) ds = as.Date(c('2016-11-01', '2016-11-02', '2016-11-03', '2016-11-04', …
Eduardo
  • 4,282
  • 2
  • 49
  • 63
2
votes
2 answers

Creating a frequency 2x2 table in R but replacing frequency data with numerical data from another variable

I am having trouble to create a table in a format required to run some analyses. Here is a simplified example of how my large dataset looks like Sample <- c(1,2,2,3,3) Species <- c("sp1","sp2","sp3","sp1","sp1") Counts <-…
2
votes
0 answers

parallelise group_walk operation with multidplyr

Is it possible to parallelise a dplyr::group_walk operation on grouped data using multidplyr? In this first attempt at a general question I won't provide a reprex, but if it helps I can. I have multiple time series for many individuals and I would…
mjrolland
  • 21
  • 4
2
votes
0 answers

Can you parallelize panel maneuvers in R?

In my R script, I'm using the pmdplyr functions mutate_cascade() and tlag() to mutate my data, which contains over 3 million records, so the code is extremely slow but it works. In order to speed things up, I tried adding the parallel processing…
Tess
  • 21
  • 1
2
votes
1 answer

How to join, group and summarise large dataframes in R with multidplyr and parallel

This question is similar to other problems with very large data in R, but I can't find an example of how to merge/join and then perform calculations on two dfs (as opposed to reading in lots of dataframes and using mclapply to do the calculations).…
leslie roberson
  • 167
  • 1
  • 15
2
votes
0 answers

Is there a way to parallelize tidyr?

I am using Tidyr to complete a time series for balances and transactions, however due to the number of individuals computation is taking a significant amount of time. I have 16 cores and R is only using one is there any way to parallelize Tidyr? …
Dominic Naimool
  • 313
  • 2
  • 11
2
votes
1 answer

R: Why parallel is (much) slower? What is best strategy in using parallel for a (left) join a large collection of big files?

I've read some questions on the subjects as well as some tutorials but failed to resolve my problem so decided to ask myself. I have a large collection of big files of types say A, B, C; and I need to left join B, C with A on some conditions. I work…
Evgeny
  • 47
  • 6
1
2 3 4