Introduction
I am working on project where a lot of textual data needs to be processed. Many quite big (hundreds of MB) text files. The python is the requirement (don't ask why). I want to use C++ extensions to increase the performance. I decided to go with SWIG. I have an pattern matching algorithm that is much faster than usual python "string".find("pattern"). I was surprised when I saw it's much slower when used as python extension. It shouldn't happen. I think I am quite close to find the reason of this but need your help.
Problem
Now, I wrote a simple extension with class containing method that do NOTHING (simply take a string as parameter and returns numeric value (no processing is happening in the function):
nothing.h:
#ifndef NOTHING_H
#define NOTHING_H
#include <string.h>
#include <iostream>
using namespace std;
class nothing {
protected:
int zm = 5;
public:
virtual int do_nothing(const char *empty);
};
#endif
nothing.cpp
#include "nothing.h"
int nothing::do_nothing(const char *empty) {
return this->zm;
}
nothing.i
%module nothing
%include <std_string.i>
using std::string;
using namespace std;
%{
#include "nothing.h"
%}
class nothing {
protected:
int zm = 5;
public:
virtual int do_nothing(const char *empty);
};
test.py
import nothing
import time
data = ""
with open('../hugefile', 'rb') as myfile:
data=myfile.read().decode(errors='replace')
n = len(data)
zm = nothing.nothing()
start = time.time()
res = zm.do_nothing(data)
end = time.time()
print("Nothing time: {}".format(end - start))
zm = nothing.nothing()
start = time.time()
res = data.find("asdasdasd")
end = time.time()
print("Find time : {}".format(end - start))
Compilation steps:
swig -c++ -py3 -extranative -python nothing.i
g++ -fpic -lstdc++ -O3 -std=c++11 -c nothing.cpp nothing_wrap.cxx -I/usr/include/python3.7m
g++ -shared nothing.o nothing_wrap.o -o _nothing.so
Output:
$ python3 test.py
Nothing time: 0.3149874210357666
Find time : 0.09926176071166992
As you can see, despite the nothing should be much faster than find() it is a lot slower!
Any idea if this can be somehow solved? For me it looks like the data is converted or copied.
Why I think the whole data is copied? Because if a slightly change the function do_nothing() to (I am omitting headers):
int nothing::do_nothing() { // removed the argument
return this->zm;
}
Then the result is as expected:
$ python3 test.py
Nothing time: 4.291534423828125e-06
Find time : 0.10114812850952148