5

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.

I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!

Flow
  • 735
  • 2
  • 7
  • 17
  • 3
    Do the following three answers help? http://stackoverflow.com/questions/17085780/how-to-use-discrepancy-analysis-with-traminer-and-aggregated-sequence-data and http://stackoverflow.com/questions/15929936/problem-with-big-data-during-computation-of-sequence-distances-using-tramine and http://stats.stackexchange.com/questions/43540/how-to-randomly-select-5-of-the-sample – Matthias Studer Jul 04 '13 at 08:15
  • Dear Matthias, thanks for your answer. I am already using the sample procedure described in your links. What I am really looking for is a way to use multiple cores to speed up the distance computation in order to apply it to the entire dataset on the super-computer. I looked at some packages which allow you to do that, but they don't work for TraMineR. But I guess running multiple subsamples is fine as well. Thanks again. – Flow Jul 04 '13 at 08:29
  • 1
    The solutions I was suggesting are: identical sequences aggregation, using `seqdist(method="OMopt")`, changing time granularity (see here: http://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence) to have more identical sequences. Which limitation are you facing ? CPU time or memory limit ? – Matthias Studer Jul 04 '13 at 08:38
  • Thanks again. I'll look into the OMopt-Method. I already changed the time granularity (from monthly to annual data). Memory limit should not be an issue, but I have a CPU time limit of 7 days. Since 10,000 observations already take quite some time, and computing time seems to increase exponentially when adding more observations, I'm not sure if this is enough. But I will give it a try. – Flow Jul 04 '13 at 09:01
  • `seqdist` only computes distances between unique sequences. There are two factors that severly impact computation time sequences length and number of unique sequences. By reducing time granularity, you affect both (see my edit of the answer below). Using trimesters may already have an impact. – Matthias Studer Jul 04 '13 at 09:15

1 Answers1

5

The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.

Apart from selecting a sample, you should consider the following optimizations:

Community
  • 1
  • 1
Matthias Studer
  • 1,722
  • 1
  • 10
  • 24
  • Thank you very much, that helps a lot! I'll just rely on subsampling techniques then. – Flow Jul 04 '13 at 08:34