3

Introduction

I am working on project where a lot of textual data needs to be processed. Many quite big (hundreds of MB) text files. The python is the requirement (don't ask why). I want to use C++ extensions to increase the performance. I decided to go with SWIG. I have an pattern matching algorithm that is much faster than usual python "string".find("pattern"). I was surprised when I saw it's much slower when used as python extension. It shouldn't happen. I think I am quite close to find the reason of this but need your help.

Problem

Now, I wrote a simple extension with class containing method that do NOTHING (simply take a string as parameter and returns numeric value (no processing is happening in the function):

nothing.h:

#ifndef NOTHING_H
#define NOTHING_H

#include <string.h>
#include <iostream>

using namespace std;

    class nothing {
        protected:
            int zm = 5;
        public:
            virtual int do_nothing(const char *empty);
    };

#endif

nothing.cpp

#include "nothing.h"

int nothing::do_nothing(const char *empty) {
    return this->zm;
}

nothing.i

%module nothing
%include <std_string.i>

using std::string;
using namespace std;
%{
    #include "nothing.h"
%}


class nothing {
    protected:
        int zm = 5;
    public:
        virtual int do_nothing(const char *empty);
};

test.py

import nothing
import time

data = ""
with open('../hugefile', 'rb') as myfile:
    data=myfile.read().decode(errors='replace')

n = len(data)

zm = nothing.nothing()
start = time.time()
res = zm.do_nothing(data)
end = time.time()
print("Nothing time: {}".format(end - start))


zm = nothing.nothing()
start = time.time()
res = data.find("asdasdasd")
end = time.time()
print("Find time   : {}".format(end - start))

Compilation steps:

swig -c++ -py3 -extranative -python nothing.i
g++ -fpic -lstdc++ -O3 -std=c++11 -c nothing.cpp nothing_wrap.cxx -I/usr/include/python3.7m
g++ -shared nothing.o nothing_wrap.o -o _nothing.so

Output:

$ python3 test.py
Nothing time: 0.3149874210357666
Find time   : 0.09926176071166992

As you can see, despite the nothing should be much faster than find() it is a lot slower!

Any idea if this can be somehow solved? For me it looks like the data is converted or copied.

Why I think the whole data is copied? Because if a slightly change the function do_nothing() to (I am omitting headers):

int nothing::do_nothing() { // removed the argument
    return this->zm;
}

Then the result is as expected:

$ python3 test.py
Nothing time: 4.291534423828125e-06
Find time   : 0.10114812850952148
user2864740
  • 60,010
  • 15
  • 145
  • 220
nosbor
  • 2,826
  • 3
  • 39
  • 63
  • Python has to create an *unmanaged* object / `char *` / string (and yes, this mean allocating and copying the data) before making the call. – user2864740 Dec 31 '18 at 18:53
  • 1
    I wonder if using a *non-Unicode* 'string' / byte-array (all Python 3 strings are Unicode, which is a change from Python 2.x) would allow SWIG a no-copy opportunity..? Alternatively, perhaps accept the Python [string] object itself without an implicit native transformation? – user2864740 Dec 31 '18 at 18:57
  • Are you looking for something like [this](https://github.com/pairinteraction/pairinteraction/blob/954f865f44bcd2c467c3077e2315c063a26cf6cc/libpairinteraction/Interface.i.cmakein#L73-L77)? – Henri Menke Dec 31 '18 at 21:39
  • I think you're looking for something more like this instead: https://stackoverflow.com/a/16998687/168175 – Flexo Jan 02 '19 at 17:19

1 Answers1

0

You'll probably want to pass the filename to C and open and search it there. You are reading bytes, converting those bytes to unicode then converting back to bytes inside the timed portion. You can read the documentation here to understand the internals.

https://docs.python.org/3/c-api/unicode.html

If the file is utf-8 then leave it in bytes by removing the decode or just pass the filename and load it in C.

MarkReedZ
  • 1,421
  • 4
  • 10
  • 1
    Thanks but I need to operate on strings not files. I will do much more operations on those strings so saving and loading from disk each time is not a good option in my problem. I just need to pass reference of string to the c++ extension. – nosbor Jan 01 '19 at 02:15