Pybind11 slows my C++ code substantially

Greetings everyone,

continuing with my particle simulation, I am trying to turn it into a python package (I work in maths, and most people would never be able to use my code if it doesn't come wrapped in either Python or Matlab lol).

I set everything up with PyBind11. I hope some of you are familiar with this.

I essentially have two classes. model<size_t DIMENSION> specifying the model (eg. how many particles, how they interact etc.) and a simulation class simulation <size_t DIMENSION> which does all the computations. The simulation gets passed the model as a reference. The template parameter DIMENSION may be 1, 2, or 3.

The idea now is to expose the simulation class, mainly its run() function, to python, such that a user just needs to do
1
2
3
4
5
6
7
import my_module

// specify some args

simu = Simulation(/* args */)
simu.run()


The problem is that the run function when called like this in Python is roughly 30 to 50% slower than when I simply set up the simulation with a C++ main.cpp file. All the heavy lifting is done within the run function and there is no explicit Python-C++ interaction happening within.

My pybind file looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <pybind11/pybind11.h>
#include "simulation.h"
#include "model.h"
#include <string>
#include <memory>
#include <variant>

namespace py = pybind11;


// using variant because my model comes in 1D, 2D, and 3D versions.
using ModelVariant = std::variant<
    std::unique_ptr<model<1>>, 
    std::unique_ptr<model<2>>, 
    std::unique_ptr<model<3>>
>;

// same for simulation
using SimuVariant = std::variant<
    std::unique_ptr<simulation<1>>, 
    std::unique_ptr<simulation<2>>, 
    std::unique_ptr<simulation<3>>
>;


// Wrapper class exposed to Python.
class PythonWrapper {

    public:
        
        PythonWrapper(int dim, /* other args*/) 
        {
            if (dim == 1){
                model_variant = std::make_unique<model<1>>(/*args*/);
                simulation_variant=  std::make_unique<simulation<1>>( *std::get<std::unique_ptr<model<1>>>(model_variant), /* other args */);
            }
            else if (dim == 2) {
                model_variant = std::make_unique<model<2>>(/*args*/);
                simulation_variant =  std::make_unique<simulation<2>>( *std::get<std::unique_ptr<model<2>>>(model_variant), /* other args */);
            }
            else if (dim == 3) {
                model_variant = std::make_unique<model<3>>(/*args*/);
                simulation_variant =  std::make_unique<simulation<3>>( *std::get<std::unique_ptr<model<3>>>(model_variant), /* other args */);
            }
            else throw std::runtime_error("Unsupported dimension. Only 1D, 2D, and 3D are allowed.");         
        }

        void run() {
            std::visit([](auto& simulation) { simulation->run(); }, simulation_variant);    // This will be much slower when run from Python for some reason.
        }

    private:

        ModelVariant model_variant;   
        SimuVariant simulation_variant;

};


// make wrapper class known to Python

PYBIND11_MODULE(my_module, m) {
    
    // Bind wrapper class.
    py::class_<PythonWrapper>(m, "Simulation")
        .def(py::init</*args*/>())
        .def("run", &PythonWrapper::run)
        
}


Compilation happens via
g++ -O3 -fopenmp -shared -fPIC $(python3 -m pybind11 --includes) -o my_module.so src/pybind.cpp $(python3-config --ldflags)

The usual C++ source code using a main.cpp is compiled via
g++ -O3 -fopenmp -o my_simu src/main.cpp

Does anyone have an idea whether I am doing something wrong?
Last edited on
its not you, its the python.
to call it in python, any of several things may be going on including copying the data extra times, converting the data types to python compatible types, run time type checking, unable to run multi-threaded code as expected, and probably other stuff. RIGHT OFF check the # of threads that it spawns vs what you told it in c++ to see if its just that limitation.

One site suggested 'nanobind' for more complex projects.
Also, you may try rewriting your program as an executable that you call from another language, by providing some way to get the output into the calling program (probably just a pipe, or binary file passing?). This would render language specifics irrelevant, though there is a cost its static and in the grand scheme your program will run as fast as it can and the overhead is just the cost to fire up the executable + recover the output, both should be fairly quick/cheap things.
If you wanted to format the output for matlab, that is pretty close to c++, you can look up what it needs, and then either use that for the transfer format for everyone or make it a switch on your executable (command line arg) if its inefficient for everyone else (I think it may be a little bloated). That is a crude method -- does unix shared objects or whatever not work like a DLL in windows where any major language can just interface to the library file?!

I think you did everything right -- the main suggestions are to call a c++ megafunction that does everything and comes back rather than try to mix the languages and call c++ bits in python loops etc.

I have not had amazing luck mixing python with c++ or anything else. Remember what python IS. Python is the layman's language -- the new basic, in some ways -- and its great for that purpose but the things it is bad at (number crunching) .. it is very, very bad at them.
Last edited on
You say it's slower but what time frame are we talking about? If it only runs for a few milliseconds then even a small fixed overhead would seem big, much bigger than if it ran for seconds or minutes.
Last edited on
@Peter87
We are talking seconds to minutes to hours, depending on the user's input (eg., how many particles).

@jonnin
thanks so much for the feedback! I am surprised that pybind11 is so slow. On their Github, I only found threads talking about how frequent interface crossings are expensive, but nothing on how a single routine can be so much slower.

One site suggested 'nanobind' for more complex projects.

I just looked at that, they claim it has 10 times lower overhead and similar syntax. Maybe worth a try.

Also, you may try rewriting your program as an executable that you call from another language, by providing some way to get the output into the calling program (probably just a pipe, or binary file passing?). This would render language specifics irrelevant, though there is a cost its static and in the grand scheme your program will run as fast as it can and the overhead is just the cost to fire up the executable + recover the output, both should be fairly quick/cheap things.

That would be an option. Write a Python function that calls the executable and uses the system's pipe utilities to get the output. Or maybe just have the executable print to a file and Python read from file. But can that be done cross-platform? Can one turn that into a package?

of course it can. you may need to hand wave at byte ordering if you go full bore on cross platform and use a file. I think the pipes just work regardless. You may need a #ifdef section if you need it work on windows without using g++ for windows (eg, if you want it to work on visual studio or whatnot). Or just tell them to use g++ on windows :)

if you swap files between boxes, and need to deal with byte order, one stupidly easy thing you can do is just stuff a known integer value as the first thing in the file. Read it, and if its backwards to what you expected, you call a different file reading routine that flips the offending values back to native.
Last edited on
Registered users can post here. Sign in or register to post.