Parallel computing for TraMineR

Question

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.

I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!

Do the following three answers help? http://stackoverflow.com/questions/17085780/how-to-use-discrepancy-analysis-with-traminer-and-aggregated-sequence-data and http://stackoverflow.com/questions/15929936/problem-with-big-data-during-computation-of-sequence-distances-using-tramine and http://stats.stackexchange.com/questions/43540/how-to-randomly-select-5-of-the-sample — Matthias Studer, Jul 04 '13 at 08:15
Dear Matthias, thanks for your answer. I am already using the sample procedure described in your links. What I am really looking for is a way to use multiple cores to speed up the distance computation in order to apply it to the entire dataset on the super-computer. I looked at some packages which allow you to do that, but they don't work for TraMineR. But I guess running multiple subsamples is fine as well. Thanks again. — Flow, Jul 04 '13 at 08:29
The solutions I was suggesting are: identical sequences aggregation, using `seqdist(method="OMopt")`, changing time granularity (see here: http://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence) to have more identical sequences. Which limitation are you facing ? CPU time or memory limit ? — Matthias Studer, Jul 04 '13 at 08:38
Thanks again. I'll look into the OMopt-Method. I already changed the time granularity (from monthly to annual data). Memory limit should not be an issue, but I have a CPU time limit of 7 days. Since 10,000 observations already take quite some time, and computing time seems to increase exponentially when adding more observations, I'm not sure if this is enough. But I will give it a try. — Flow, Jul 04 '13 at 09:01
`seqdist` only computes distances between unique sequences. There are two factors that severly impact computation time sequences length and number of unique sequences. By reducing time granularity, you affect both (see my edit of the answer below). Using trimesters may already have an impact. — Matthias Studer, Jul 04 '13 at 09:15

score 5 · Accepted Answer · edited May 23 '17 at 12:32

The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.

Apart from selecting a sample, you should consider the following optimizations:

aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, set method="OMopt", instead of method="OM". Depending on your sequences, it may reduce computation time.

Thank you very much, that helps a lot! I'll just rely on subsampling techniques then. — Flow, Jul 04 '13 at 08:34

Parallel computing for TraMineR

1 Answers1