0

I am running a Python script on a Linux server. It is based on the scikit learn count vectorizer. Parts of Scikit learn are written in Cython, so C-extensions are being used.

As long as the number of vectors is limited, everything works fine, but if the number increases, it gives a Segmentation fault. I think the part of the code where it goes wrong is around here:

def train(bodies, y_train, analyzetype, ngrammax, table, dim, features):
vectorizer = CountVectorizer(input='content', 
                             analyzer=char, 
                             tokenizer=tokenize,
                             ngram_range=(1,4),
                             lowercase=False
                             )
X_train = combine(vectorizer.fit_transform(bodies), 
                  embeddings(bodies, table, dim),
                  features)

I already set the stack size to unlimited using

ulimit -s unlimited

This did not solve the issue.

I also tried tracing the issue by displaying all line numbers. But unfortunately I couldn't make this work.

Community
  • 1
  • 1
Lucas1988
  • 15
  • 5
  • 5
    I don't think it's realistic to assume someone can determine the cause of a segmentation fault in your script if you "will not go in too much detail about what it does exactly". You're going to need to show some specific code in the area that you think it's failing. – lurker Dec 21 '13 at 15:03
  • Create a minimal testcase and report it as a bug. If python code that doesn't use things like ctypes segfaults it's pretty much never your fault. – ThiefMaster Dec 21 '13 at 15:23
  • I forgot to mention: Parts of Scikit learn are written in Cython, so C-extensions are being used. – Lucas1988 Dec 21 '13 at 15:30

0 Answers0