scientific computing - optimize speed

Greetings,

this is a longer thread, so thank you in advance for anyone who takes the time. It might be that no one can help me without sharing the whole github (which I can't do at the moment), but I will still try to summarize the issues.

I need some help with optimizing my simulation code for speed.


Particles move in a cubic simulation box with their positions and velocities being updated in a loop. In each iteration, a force is computed (most expensive part of the simu) for which the distances between all particle pairs are required like
1
2
3
4
5
6
7
8
9
    for (int i = 0; i < model.N_particles; ++i) {
        for (int j = i + 1; j < model.N_particles; ++j) {
        
            // get distance between particle i and j
            
            // compute interaction between i and j 
        
        }
    }


I have implemented the whole simulation in an OO setup, where the force loop uses OpenMP multithreading (for which I had gotten help here: https://legacy.cplusplus.com/forum/beginner/285710/). I am quite happy with the implementation of the full simulation, except for one thing: I have two different code versions, one for the one-dimensional case (where particles move in 1D) and one for the two-dimensional case. I wanted to unify this into a single code, where the dimension is passed in `argv`. This turned out to be surprisingly difficult to do without losing lots of speed (due to dynamic memory allocation).

More details:
In the fixed dimension case, I work with a struct to store positions, velocities, forces etc.:
1
2
3
4
struct coordinate{ 
    double x{0};
    double y{0};
};


This is the multithreaded force function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
inline void simulation:: compute_force_par()    // part of simulation class
{

    // forces_for_all_tasks holds force vectors for each threads.
    for (auto& forces_for_specific_task : forces_for_all_tasks)
        std:: fill(forces_for_specific_task.begin(), forces_for_specific_task.end(), coordinate{0,0});

    // iterate over particle pairs
    #pragma omp parallel for schedule(dynamic) num_threads(THREADS)
    for (int i = 0; i < model.N_particles; ++i) {
        for (int j = i + 1; j < model.N_particles; ++j) {

        coordinate distance = model.get_distances_ij(i, j); // get distance from model
        coordinate force_ij = (model.*(model.get_force_ij))(distance);  // compute interaction 
        
        // now fill the force vectors on each thread
        std:: vector <coordinate>& forces_for_this_task = forces_for_all_tasks[omp_get_thread_num()];
        
        forces_for_this_task[i].x += force_ij.x;
        forces_for_this_task[i].y += force_ij.y;
        forces_for_this_task[j].x += -force_ij.x;  // Newton's law F_ij = -F_ji
        forces_for_this_task[j].y += -force_ij.y;
        
        }
    }

    // Sum all of the task-specific forces into the output parameter
    std:: fill(model.forces.begin(), model.forces.end(), coordinate{0, 0});
    for (auto const& forces_for_specific_task : forces_for_all_tasks)
        for (int i = 0; i < model.N_particles; ++i)
        {
        model.forces[i].x += forces_for_specific_task[i].x;
        model.forces[i].y += forces_for_specific_task[i].y;
        }
}

The distance is computed via
1
2
3
4
5
6
7
8
9
10
11
12
13
14
inline coordinate IPS_model:: get_distances_ij(const size_t i, const size_t j){  // part of IPS_model class.


    double dx {positions[i].x - positions[j].x};   // positions is member of IPS_model and of 
    double dy {positions[i].y - positions[j].y};   // type std::vector <coordinate>  

    // take care of periodic boundaries...

    coordinate distances {dx, dy};

    return distances;

}

In particular, I get away with creating a `coordinate` object from scratch here each time the function is run.
The ugly syntax for the interaction function pointer
coordinate force_ij = (model.*(model.get_force_ij))(distance); is because I have different possible interaction functions get_force_ij which are chosen during runtime (also passed to argv), but the inner structure is similar to get_distances.

Now, the problem with creating a version that works for d dimensions is that the struct coordinate cannot be created like this because the number of double variables it needs to store is not known during compile-time. Naively, in a first attempt, I just got rid of that struct altogether and used std::vector<double> instead. This, however, led to a code that was 10 times slower!! The reason for that seemed to be 2-fold:
1) Recreating a std::vector in the function get_distances_ij instead of the line coordinate distances(dx,dy) seems to be much more expensive. I think this is because the memory for the vector is allocated dynamically there as opposed to the struct whose size is known at compile time.
2) The `positions`, `velocities`, and `forces` vectors are now no longer of type std::vector<coordinate>, each coordinate being a struct holding two `double`, but std::vector<std::vector<double>>. Apparently, this setup is less cache friendly as all of the inner vectors will not be stored in a contiguous way (as opposed to the double values of the struct).

I then created a version where no vectors where recreated in the tight double loop, only filled with elements. I also turned the 2D vectors into a flattened 1D vector to improve cache friendliness. That led to a much faster version, but it was still twice as expensive as the original one.

(end of post for word length limit)
Last edited on
I asked ChatGPT for advice, and it recommended me to use std::array structure like
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
constexpr size_t MAXDIM = 3; // Maximum allowed dimension.

class coordinate {

    public:
        
        // Default constructor.
        coordinate() 
            : dimension {0} {
                coords.fill(0.0); // Initialize all elements to 0.0.
        }

        coordinate(size_t dim)
            : dimension {dim} {
                coords.fill(0.0);
        }

        // Method to set the runtime dimension.
        void set_dimension(size_t dim) {
            if (dim > MAXDIM) {
                throw std::out_of_range("Dimension exceeds MAXDIM.");
            }
            dimension = dim;
        }

        // Access operator for reading/writing within the runtime dimension.
        double& operator[](size_t index) {
            // if (index >= dimension) {
            //     throw std::out_of_range("Index exceeds runtime dimension.");
            // }
            return coords[index];
        }

        const double& operator[](size_t index) const {
            // if (index >= dimension) {
            //     throw std::out_of_range("Index exceeds runtime dimension.");
            // }
            return coords[index];
        }

    private:
        std::array<double, MAXDIM> coords; // Fixed-size array.
        size_t dimension;                  // Actual dimension.

        // Getters
        size_t size() const { return dimension; }

};

The trick here is that while the actual dimension is passed during runtime, one always allocates the memory for the maximum allowed dimension such that memory needed for the struct is known during compile time.

These are the resulting force routines:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
inline void simulation:: compute_force_par()  
{

    #pragma omp parallel num_threads(THREADS)
    {
        thread_local coordinate distance(model.dimension);
        thread_local coordinate force_ij(model.dimension);

        #pragma omp for schedule(dynamic)
        for (int i = 0; i < model.N_particles; ++i) {
            for (int j = i + 1; j < model.N_particles; ++j) {
                
                model.get_distances_ij(i, j, distance);
                (model.*(model.get_force_ij))(distance, force_ij);

                std:: vector <coordinate>& forces_for_this_task = forces_for_all_tasks[omp_get_thread_num()];
                
                for (size_t dim = 0; dim<model.dimension; ++dim){
                    forces_for_this_task[i][dim] += force_ij[dim];
                    forces_for_this_task[j][dim] += -force_ij[dim];
                }

            }
        }
    }

        // Sum all of the task-specific forces into the output parameter.
    std:: fill(model.forces.begin(), model.forces.end(), coordinate(model.dimension));
    for (auto const& forces_for_specific_task : forces_for_all_tasks)
        for (int i = 0; i < model.N_particles; ++i)
            for (size_t dim = 0; dim<model.dimension; ++dim){
                model.forces[i][dim] += forces_for_specific_task[i][dim];
            }

}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
inline void IPS_model:: get_distances_ij(const size_t i, const size_t j, coordinate& distances){

    const double two_L {2*L};
    double dx;

    for (size_t dim = 0; dim<dimension; ++dim){
        dx = positions[i][dim] - positions[j][dim];
        // take care of periodic boundaries...
        distances[dim] = dx;
    }

    return;

}


ChatGPT promised that this will be the most efficient implementation possible. Yet, the code is 2 to 3 times slower than the original version with compile-time known dimension.

I have absolutely no clue why there is such a huge difference in speed.


I compile with `g++ -O0` and use a single thread to run it.

Does anyone have any idea what's going on or what the best way to handle this is?
Using `perf`, I verified that 98% of the work actually comes from the force routine. These are the first lines of the `perf` record:
The new version with flexible dimension:
1
2
3
4
5
6
7
8
9
10
11
12
13
+   99.46%     0.00%  IPS_stdarr.out  libgomp.so.1.0.0      [.] GOMP_parallel                                                                                                                                                                                    
+   99.46%     0.00%  IPS_stdarr.out  [unknown]             [.] 0x00000000000003e8                                                                                                                                                                               
+   99.46%     0.00%  IPS_stdarr.out  [unknown]             [.] 0x0000000000000003                                                                                                                                                                               
+   99.41%     0.00%  IPS_stdarr.out  [unknown]             [.] 0x0000560a677dc290                                                                                                                                                                               
+   99.15%    10.41%  IPS_stdarr.out  IPS_stdarr.out        [.] simulation::compute_force_par() [clone ._omp_fn.0]                                                                                                                                               
+   39.23%    12.42%  IPS_stdarr.out  IPS_stdarr.out        [.] coordinate::operator[](unsigned long)                                                                                                                                                            
+   31.86%    13.47%  IPS_stdarr.out  IPS_stdarr.out        [.] IPS_model::get_distances_ij(unsigned long, unsigned long, coordinate&)                                                                                                                           
+   31.31%     9.97%  IPS_stdarr.out  IPS_stdarr.out        [.] IPS_model::get_force_ij_gauss(coordinate const&, coordinate&)                                                                                                                                    
+   26.26%    13.52%  IPS_stdarr.out  IPS_stdarr.out        [.] std::array<double, 3ul>::operator[](unsigned long)                                                                                                                                               
+   18.38%    13.56%  IPS_stdarr.out  IPS_stdarr.out        [.] std::__array_traits<double, 3ul>::_S_ref(double const (&) [3], unsigned long)                                                                                                                    
+   16.19%     5.31%  IPS_stdarr.out  IPS_stdarr.out        [.] coordinate::operator[](unsigned long) const                                                                                                                                                      
+   11.72%     6.33%  IPS_stdarr.out  IPS_stdarr.out        [.] std::array<double, 3ul>::operator[](unsigned long) const                                                                                                                                         
+    7.03%     4.76%  IPS_stdarr.out  IPS_stdarr.out        [.] std::vector<coordinate, std::allocator<coordinate> >::operator[](unsigned long)  


Original 2D version:
1
2
3
4
5
6
7
8
9
10
+   98.82%     0.00%  IPS_2D_perf.o  [unknown]             [.] 0000000000000000
+   98.82%     0.00%  IPS_2D_perf.o  [unknown]             [.] 0x00000000000003e8
+   98.82%     0.00%  IPS_2D_perf.o  [unknown]             [.] 0x0000000000000003
+   98.82%     0.00%  IPS_2D_perf.o  libgomp.so.1.0.0      [.] GOMP_parallel
+   96.32%    20.63%  IPS_2D_perf.o  IPS_2D_perf.o         [.] simulation::compute_force_par() [clone ._omp_fn.0]
+   32.42%    23.42%  IPS_2D_perf.o  IPS_2D_perf.o         [.] IPS_model::get_distances_ij(unsigned long, unsigned long)
+   23.85%    16.33%  IPS_2D_perf.o  IPS_2D_perf.o         [.] std::vector<coordinate, std::allocator<coordinate> >::operator[](unsigned long)
+   19.33%    17.45%  IPS_2D_perf.o  IPS_2D_perf.o         [.] IPS_model::get_force_ij_gauss(coordinate)
+   10.22%     9.25%  IPS_2D_perf.o  libm.so.6             [.] __ieee754_exp_fma
+    7.65%     5.66%  IPS_2D_perf.o  libm.so.6             [.] exp@@GLIBC_2.29
Registered users can post here. Sign in or register to post.