I am running a Python script on a Linux server. It is based on the scikit learn count vectorizer. Parts of Scikit learn are written in Cython, so C-extensions are being used.
As long as the number of vectors is limited, everything works fine, but if the number increases, it gives a Segmentation fault. I think the part of the code where it goes wrong is around here:
def train(bodies, y_train, analyzetype, ngrammax, table, dim, features):
vectorizer = CountVectorizer(input='content',
analyzer=char,
tokenizer=tokenize,
ngram_range=(1,4),
lowercase=False
)
X_train = combine(vectorizer.fit_transform(bodies),
embeddings(bodies, table, dim),
features)
I already set the stack size to unlimited using
ulimit -s unlimited
This did not solve the issue.
I also tried tracing the issue by displaying all line numbers. But unfortunately I couldn't make this work.