C++ multithread algorithm creating several threads running on the same CPU thread

Question

I am new to multithreading and I will try to be as clear as possible. I am creating a multithread algorithm in C++ with standard library, std::thread. My code is compiling and running with no errors. I put the code for two threads. Two threads are creating with different id than the main one (I checked with getid() and the thread window in Visual Studio). The problem is that I have no gain in time, and the % utilization of my CPU is the same so it seems that they run on the same CPU thread.

Does it means that threading doesn't automatically mean running on different CPU cores and threads to gain time?

Do I need to add instructions or combine this code with a multithread library? Or is there just a mistake? Any help would be really appreciated, thanks

double internalenergy=0;
unsigned int NUM_THREADS = std::thread::hardware_concurrency();
NUM_THREADS=2;
 vtkIdType num_cells = referencemesh_->GetNumberOfCells();
 int start_node[8];
 int end_node[8];
 int intervalle=num_cells/NUM_THREADS;
for( unsigned int i=0; i < NUM_THREADS; ++i )
{
    start_node[i] = (float)num_cells*i/NUM_THREADS;       
    end_node[i]   = (float)num_cells*(i+1)/NUM_THREADS;
}
const double* pointeurpara;
pointeurpara=&params(0);

double value1=0;
double* P_value1;
P_value1=&value1;

double value2=0;
double* P_value2;
P_value2=&value2;

std::thread first(threadinternalenergy,activemesh,referencemesh_,start_node[0],P_value1,pointeurpara);
std::thread second (threadinternalenergy,activemesh,referencemesh_,start_node[1],P_value2,pointeurpara);
first.join(); 
second.join();
double displacementneighpoint=*P_value1+*P_value2;

Here is the source code of threadinternalenergy

void threadinternalenergy(vtkSmartPointer<vtkPolyData>, 
activemesh,vtkSmartPointer<vtkPolyData> referencemesh,int startcell,double* 
displacementneighpoint,const double* p) 
{
 unsigned long processnumber;
 processnumber=GetCurrentProcessorNumber();
 cout<<"process "<<processnumber<<endl;

unsigned int NUM_THREADS = std::thread::hardware_concurrency();
NUM_THREADS=2;
vtkIdType num_cells = referencemesh->GetNumberOfCells();
int intervalle=num_cells/NUM_THREADS;
int endcell=startcell+intervalle;
//cout<<"start cell "<<startcell<<endl;
//cout<<"end cell "<<endcell<<endl;
int numcell_th=startcell-endcell;

 vtkSmartPointer<vtkEdgeTable> vtk_edge_tableT = 
vtkSmartPointer<vtkEdgeTable>::New();
  vtk_edge_tableT->InitEdgeInsertion(numcell_th*3);

  // Initialize edge table
  vtkSmartPointer<vtkEdgeTable> vtk_edge_table = 
vtkSmartPointer<vtkEdgeTable>::New();
  vtk_edge_table->InitEdgeInsertion( numcell_th*3 );

  cout<<"start cell "<<startcell<<endl;

 for( vtkIdType i=startcell; i < endcell; ++i )
  {   
    vtkCell* cell = referencemesh->GetCell( i );
 // cout<<"cell"<<endl;
    // Traverse edges in cell -- assuming a linear cell (line, triangle; NOT 
  rectangle)
    for( vtkIdType j=0; j < cell->GetNumberOfEdges(); ++j )
    {
        // cout<<"edge"<<endl;
      vtkCell* edge = cell->GetEdge( j );

      vtkIdType pt0 = edge->GetPointId(0);
      vtkIdType pt1 = edge->GetPointId(1);


    //consider edge if a displacement has been point by at least one point
        //if(params(pt0)!=0 || params(pt1)!=0){


          if( vtk_edge_table->IsEdge( pt0, pt1 ) == -1 ){     // If this edge 
     is not in the edge table
                vtk_edge_table->InsertEdge( pt0, pt1 );
              //acess point coordinate
             // vtkPoints* listpoints=edge->GetPoints();
              double p0 [3];
              double p1 [3];
              referencemesh->GetPoint(pt0,p0);
              referencemesh->GetPoint(pt1,p1);


                //2e Mesh (Mesh transform)
                 double p0T [3];
              double p1T [3];
              activemesh->GetPoint(pt0,p0T);
              activemesh->GetPoint(pt1,p1T);


                    // If this edge is not in the edge table
                vtk_edge_tableT->InsertEdge( pt0, pt1 );


                //find displacement difference for 2 points sharing an edge
                double squaredDistancep0 = 
                vtkMath::Distance2BetweenPoints(p0, p0T);
                double distancep0 = sqrt(squaredDistancep0);
                double squaredDistancep1 = 
                vtkMath::Distance2BetweenPoints(p1, p1T);
                double distancep1 = sqrt(squaredDistancep1);
                double difference = abs(distancep0 - distancep1);
                difference=((difference+1)*(difference+1))-1;
                //if(difference>0.25){
                    //cout<<"grosse difference "<<difference<<endl;
                //}
                *displacementneighpoint+=difference;
                //displacementpointneigh.push_back(difference);



             }

        //}

    } 
  }
 cout<<"Fin du thread "<<endl;
 }

You can see this question for more information [Can multithreading be implemented on a single processor system?](https://stackoverflow.com/questions/16116952/can-multithreading-be-implemented-on-a-single-processor-system) — Borislav Kostov, Jan 24 '18 at 20:01
Microsoft Visual studio 2012, windows, the CPU is i7-4790 3.6Hz — M-A BouBou, Jan 24 '18 at 20:05
You can check which processor a thread is on with [GetCurrentProcessorNumber](https://msdn.microsoft.com/en-us/library/windows/desktop/ms683181(v=vs.85).aspx) (Windows). Also, keep in mind, any race conditions in the thread proc can slow down performance. — 001, Jan 24 '18 at 20:15
You should not have to do anything special. You might get this because the logic within the threads depends on a shared resource. Do you use any explicit locking? Another, less likely, explanation would be that only one core is available for work. — 500 - Internal Server Error, Jan 24 '18 at 20:32
@rustyx Good point. If `first` doesn't get `end_node` it will most likely process the entire data set (guessing). `second` will do half the data. And `first.join()` will wait for `first` to complete so there is no speed improvement. — 001, Jan 24 '18 at 20:33
About the end_node, that information was missing, I compute it in threadinternalenergy, they both do half the data — M-A BouBou, Jan 24 '18 at 20:37
@M-ABouBou edit your question and add the source code of `threadinternalenergy`. And better, create a [mcve] — Anton Savin, Jan 24 '18 at 20:38
I used the GetCurrentProcessorNumber() and for a 4 threads example, I have 3 different numbers 4,4,7,1... And I do have two consecutives "process" output and two "start cell" output like it is running in parallel... I am not sure how to interpret that but is it possible that it is running in parallel? If so, why do I not see my CPU % utilisation changing or gain in time? — M-A BouBou, Jan 24 '18 at 20:55
FYI - 1 data point: On my older Dell with dual cores, the (Ubuntu 15.10 version of) Posix Process Local Mode Semaphore completes both "::sem_wait()" AND "::sem_post()" in about 23 ns (when no context switch). The context switch (controlled by the PPLSem_t between 2 C++ std::threads) measures about 270 ns. — 2785528, Jan 25 '18 at 00:43

Jive Dadson · Answer 1 · 2018-01-24T20:52:46.520

The threads probably run on different cores. That does not necessarily mean the program will be faster. It could possibly run much slower than with one thread. The threads must do enough work concurrently to justify the overhead of thread-creation and synchronization. There are lots of pitfalls. A likely problem is that the threads are sharing (or "false sharing") memory too often.

Yes, one main idea behind threading is to "do concurrency." But that can be done with a CPU that only has one core! Or it can fail to do it well with lots of cores.

I wrote a system for controlling a robot that handled computer wafers. It ran on a CPU with one core, yet it depended strongly on concurrency. The concurrency resulted from having devices other than the CPU running concurrently with it, e.g. a servo amplifier and a network card. One high priority thread dealt exclusively with managing the servo via the network. Other threads used the slack time to plan trajectories, communicate with a host computer, etc. ... When the the servo-control thread woke up because the network card had a response from the servo, it immediately took over the CPU and did its thing.

There was a talk this year at cppCon titled "When a microsecond is an eternity". The guy programs one of those sneaky stock-trade systems. All that matters to him is how fast he can deal with the network that connects him to the market. To get the throughput he needs, he uses a 4-core machine with three cores disabled. Why? Because he can't find a good enough 1 core machine.

I also wrote a class that used multi-threading to do some calculations on a multi-core machine. It failed miserably. Some tasks are just not suited to divide-and-conquer.

Thanks, about the memory sharing, the threads indeed all work with the same data: activemesh, does it create complication that increases time? Should I avoid that? For the time, normally it could take between 1 or 2 seconds (it is used in an optimization which do thousands of iterations), is 1 seconds too short to optimize with multithreads? — M-A BouBou, Jan 24 '18 at 21:14
When two threads access the same cache-line with at least one of them writing to it, the hardware or OS has to scramble to get them into agreement. That can take 30 x the time to do a simple cache access. I can't give a lesson here. Get a book written by someone who knows a lot more about it than I do. — Jive Dadson, Jan 24 '18 at 21:23
Ok it is only reading for the share data, but maybe it stills do something similar... I will try to find more informations — M-A BouBou, Jan 24 '18 at 21:29

C++ multithread algorithm creating several threads running on the same CPU thread

1 Answers1