Same ramblings and ruminations in the moment.

Goodbye Berkeley

The brief period of unemployment between graduation and the starting date at my first full-time job out of college is basically purgatory. Especially with the graduation ceremony only happening in the Spring following my graduation, it feels even more so. There are many loose ends, with research, club involvement, and connections being the biggest contributors to my continual connection with campus. But as the dust settles in this chaotic phase of my life, I can only slowly realize that the last few years of my life have only been to set the default of my continual success in both personal and professional matters.

Just a quick reflection on some major learnings and takeaways.

Academia vs Industry (aka the pursuit of impact)

I struggled to narrow down my true interests in terms of career path. It’s definitely a very naive and perhaps privileged way of thinking – that one can prioritize making a meaningful, positive impact. Those are just words, and while it’s possible, it’s important to think of goals in varying levels of granularity. Of course impact might be the eventual goal, but there are smaller and more immediate ways to achieve impact that may be somewhat collinear in trajectory.

Somewhere along the line, perhaps idolization of professors and other academics, I found myself on the grad school track. Especially in the blockchain space I had so suddenly found myself immersed in, the community is stratified by product/technology, founders of which often times reach demi-god-like status. My reverence for certain professors had reached similar heights. I wanted to make big impact by focusing on big picture topics in academic research – somewhat overlooking some of the prerequisite engineering rigor.

After seeking advice from others, I eventually realized my prioritization of applied vs pure sciences. While we need all types of thinkers, I came to understand my strengths in working in a more instant gratification impact manner. I will use my academic training combined with engineering rigor to solve real-world problems. Grad school will be on the roadmap eventually, but for now, I’d rather see for myself the problems that need solving. Then, if I deem it only possible to solve these problems with the leverage of academia and graduate studies, then so be it – chase that impact.

Mindfulness and parallelization of the brain

Coming into college, I knew immediately that there would be a billion things to do and only the time for a few of them. Like attempting too large of a plate during Thanksgiving dinner, it’s important to pace yourself. If you frontload all your responsibilities and end up underperforming or underdelivering, that’s worse than exceling at only a few items. A few more takeaways here:

  • Avoid context switching
  • Attempt tasks on a rolling basis

In other words, prioritize, order, and pipeline. In my life, the implications are that my daily TODO lists are not too long, ordered, and consider future days’ TODOs as well (e.g. a lower granularity weekly/monthly/yearly TODO). This also plays directly into my personal definition of practical mindfulness.

Others

  • Take nothing for granted
  • Don’t mistake motion for progress
  • Make the most of where you are

Time is fake

"Time has been invented in the universe so that everything would not happen at once."

Distributed systems

It all started with researching distributed systems safety, and the tools used to test it: namely Jepsen and WANem. The concept of time in the coordination of systems has always interested me. Especially that of distributed systems and its extension to the more philosophical.

The universe is thrashing

When your computer hits a bottleneck in compute, everything else seems to slow down. That’s because every process on your computer is sharing time.

When your computer hits a bottleneck in memory, everything also seems to slow down. That’s due to consistent paging in and out of memory - thrashing.

These two scenarios are bottlenecks in the handling of information. It’s clear how memory (and the memory hierarchy) represent information. Compute can also represent information, just over time. And the common way to handle information overload is to slow down time. To keep the flow of new information consistent as we encounter high density pockets of information. For computer hardware, there’s a physical bottleneck - and upper bound on the information throughput that is processable.

The universe also has this inherent bottleneck - a bottleneck on the rate of information processing - the speed of light.

As you get close to a black hole, the immense mass bends space time and theories of special relativity suggest that time slows down for you. Mass is information, at least according to Maxwell. In other words, in an attempt to process so much information, the universe is thrashing.

But this phenomena only occurs to the observer, which begs the question: is the universe thrashing, or is the observer’s observation (or the observer themselves) thrashing? Is there such a distinction?

Exploring Space(time?)

It was a peaceful summer evening sitting with the family and listening to bossa nova on Spotify. But my mind once again drifted to thoughts of the essence of space and time. The music playing through the speakers has itself no understood notion of time. Yet the delayed playback of the music brought about its melodies and harmonies. Downloaded music is just a collection of bits, existing all at once in space. But by interpreting it as a data stream with a particular bitrate, imbuing it with an artificial notion of change and time, we’re able to enjoy the music fully as humans. Imagine being handed a CD or USB with music (or anything else) loaded on it and being able to fully understand it simply by glancing (or sensing?) it, perhaps akin to static analysis.

In any case, the encoding of data to a physical form separable from time reminded me of that work of Roman philosopher Lucretius. He stated that the universe consisted of atoms, and that a (random) atomic swerve randomized their otherwise deterministic movement. Movement is of course closely tied with the notion of time, for there is no motion without the passage of time. Perhaps then we could view motion as an artifact of the limited processing capacity of the human brain: needing to invent notions of time and change to process an otherwise static universe. Perhaps all atoms, as Lucretius suggests, exist eternally. But their movement caused by natural trajectory and atomic swerve are artifacts of both human presence and intervention. The human brains’ processing of the universe inadvertantly invented time.

The Big Picture

I’ve been stuck in the weeds recently learning all these specific technologies, and while there’s merit in that, sometimes I need to take a breather and explore the big picture. Everything needs a clear direction and in that sense the big picture of systems and the applications and workloads they support is this guiding north star.

Scaling Distributed Systems

Design

Fixing bugs takes order 10 to 100x longer time than it takes to account for the bug in design and work a solution around it.

Also have to determine in design which dimensions you can possibly scale at. There’s also the concern of whether you solve issues now in design or solve them up front as they arise, hitting some scaling wall.

Microservices

Microservices is a buzzword, and there’s no real cutoff between services and microservices in regards to compute, storage, other metrics, etc. Instead, it’s more valuable to think of pushing down the scale of any service you write. Smaller services are easier to swap in and out. For example, if the initial design was flawed or a new better way to solve a problem is discovered, small scale services provide the modularity needed to easily swap out individual parts of a service.

There’s a bit of a paradox here:

  • Microservices increase scalability by modularity, which means loose coupling
  • Scalability can be achieved by tight coupling of individual components

Moral of the story is that things will need to be titrated in production.

What Picture? Tangent on how to make impact

Note: A bit of a tangent, and me jotting down a bunch of ideas/ranting:

Lamport implies various times in his Turing Award paper “The Computer Science of Concurrency: The Early Years” that there’s a huge gap between engineering solutions to concurrency issues and the study of concurrency in computer science. Dijkstra studied and published fault tolerances and self stabilization in 1974, without having the direct application of his work in mind. Perhaps he (and others) anticipated fault tolerance to be a major field of study within computer science and thus chose to pursue the study, but it’s hard to know for certain. It’s more worth noting (personally) that early systems implemented solutions to concurrency and fault tolerance in distributed systems from the engineering perspective, without consulting the formalism of scientific work such as that of Dijkstra.

As Lamport puts it, “A survey of fault tolerance published in 1978 does not mention a single algorithm, showing that fault tolerance was still the province of computer engineering, not of computer science.” [2]

That is to say that there’s an obvious tradeoff between engineering and science-ing solutions – in many aspects. One is driven by the need for a solution so as to solve an issue with an application or domain focus. The other is driven by the pure scientific pursuit. Of course there might be some driving application or foresight/insight, but science often abstracts away the usages of its findings. Papers that do not are often either whitepapers describing something that has already been built (in which case it might as well be a high level system spec) or pressured somehow (e.g. by source of funding) to append notes on future work or applications.

In terms of the cycle of impact-making: it seems from my perspective that when problems first arise, a solution is hacked first. Unless it’s a very fundamental problem, it’s often not worth it to pay the cost of all the abstraction overhead and design from the top down again (scientifically). Problems are run into in industry or other engineering circumstances, and should be dealt first in that context. If there seem to be recurring problems with the solution, or similar problems keep arising, then it seems like the best way to amortize future time spent fixing problems is to subject the nature of the problem in question to more rigorous scientific study. In general, we work up the stack: assuming problems arise in engineering and not just scientific ideation, impact is first made directly facing the application, and then later escalated to pure scientific pursuit if necessary.

Adopting a science-only or engineering-only policy seems suboptimal, since you’ll either always be upper or lower bounding the abstraction layers you’re working on. Instead, pick up the skillset to work in between. Then brings up the question of whether workflow should be communication bound or compute bound. I mean communication in terms of ferrying work and reults between engineering and science workflows, and compute in terms of actual time and engergy spent in engineering and science. It’s definitely hard to translate scientific works to the real world. Academics often deal in situations of theory and absolutes and assumptions. And when it comes to implementation, that’s when ideas meet reality. Perhaps a happy middleground to focus on would be academically-focused engineering.

References

  1. Hidden Scaling Issues of Distributed Systems
  2. The Computer Science of Concurrency: The Early Years

Exploring Parallel Particle Simulation

Summary

The main problem is to simulate the interactions of particles within a finite space. Particles exert forces of repulsion on each other and can also bounce off the walls of the simulation grid.

nbody simulation with cutoff

Here, we present iterative improvements on a toy particle simulation. As with the previous post on parallelization, all results were calculated on Intel Xeon E5- 2698 v3 2.30 GHz CPU, this time across multiple processors. Graphs show performance in per processor $p$ speedup and also simulation time against the number of perticles and number of processors allocated.

We present three main algorithms:

  • Serial algorithm
  • Shared memory algorithm, OpenMP
  • Distributed memory algorithm, MPI

Serial Algorithm

We can reduce the running time complexity of our particle simulation to O(n) if for every particle we only consider the particles nearby. To formalize this, we split the simulation grid into bins and assign each particle to a bin based on its location in the grid. Then, for each particle in each bin, we consider the particles in its own bin as well as those in adjacent bins. So in total, there are 9 bins to consider. This reduces the standard O(n^2) running time complexity to O(n) due to the constant 9 bins.

To optimize the number of bins, we need to consider the effective area around each particle that applies the repulsive force. This is defined as cutoff. The dimensions of each bin should thus be a function of cutoff.

In implementation, we define datatype bin_t as follows, where a bin_t is a vector or particle:

typedef std::vector<particle_t> bin_t;

We choose store a vector of particles rather than a vector of particle pointers due to concurrency issues for when we parallelize.

Before the main simulation loop, we place all particles into their respective bins based on their starting x and y coordinates with some simple geometry:

//
// Puts particles into vector of all bins
//
void putIntoBins (particle_t* particles, vector<bin_t>& allBins, int n) {
    gridSize = sqrt(n * DENSITY);
    numBins = int(gridSize / binSize)+1; 
    allBins.resize(numBins * numBins);
    for (int i = 0; i < n; i++) {
        int x = int(particles[i].x / binSize);
        int y = int(particles[i].y / binSize);
        allBins[x*numBins + y].push_back(particles[i]);
    }
}

When considering the bins at the edge of the simulation grid, we do not have to consider a full 9 bins, but rather only 4 or 6 bins depending on whether the bin in consideration is on an edge or a corner. There are a few approaches to solving this issue:

  1. Using conditionals to check whether the bin is a corner or edge and then performing the appropriate simulation steps
  2. Padding the simulation grid with one layer empty bins and change bin iteration from [0,n) to [1,n-1) where n is the grid dimension

We choose to pursue the first approach so as to avoid additional memory usage for simulations of large size, since padding is linear to the dimensions of the simulation grid.

After simulating and applying force to each particle, we then move the particles to their new bins in the next simulation step. We choose to recycle the bins data structure.

Grid corner checking

In our main simulation loop, we consider all 8 surrounding bins around a single bin, but use conditionals to check for bounds. This is implemented as follows:

  int dx[3] = {-1, 0, 1};
  int dy[3] = {-1, 0, 1};

  for (int dx_index = 0; dx_index < 3; dx_index++) {
      for (int dy_index = 0; dy_index < 3; dy_index++) {
          if (i + dx[dx_index] >= 0 
              && i + dx[dx_index] < numBins 
              && j + dy[dy_index] >= 0 
              && j + dy[dy_index] < numBins) {
            ... // apply forces
          }
      }
  }

The arrays dx and dy represent the x and y coordinates of the surrounding bins, and we use simple bound checking with respect to the simulation grid in the innermost conditional.

Additional bin size considerations

We observed that bin size should be a function of the cutoff distance of each particle. Initially, we noticed that if we make the dimensions of each bin cutoff x cutoff, then in the worst case each particle would only have effect on the 9 bins as previously specified. We tested various bin sizes:

serial_binsize

All simulation times are measured in seconds on the y-axis. We noted that in implementation, having a bin size of cutoff * 2 consistently provided higher performance than previously speculated size of simply cutoff. We hypothesize this is due to the cost of potentially checking 9 bins compared to the relatively lower cost of our conditional checking. If bin size is cutoff * 2, there is a good chance that a particle’s force field of cutoff radius will not cross into all 9 bins. Having a bin size of cutoff * 2 maximizes on this, while avoiding making the bins too big so as to negatively affect the memory performance.

Shared Memory Algorithm, OpenMP

We used OpenMP to parallelize the particle simulation, and mostly kept the algorithm the same as the one in serial, with some minor adjustments for synchronization. Firstly, we use the parallelization option private(dmin) to make each thread have a private copy of the dmin variable. Similarly, for davg and navg, we use #pragma omp for reduction(+:navg,davg) in the simulation loop when moving particles. This is after noting that apply_force in common.cpp aggregates navg and davg using addition.

Synchronization

We also introduce an additional data structure bin_t moved_particles to keep track of the moved particles for each thread. We tried considering having different bin_t moved_particles for different threads and collecting the results thread by thread in the end, rather than using a lock around the variable.

parallel_synch

The above graph shows the results we achieved when we tried using different bin_t moved_particles for different threads and culminated them in the end.

In consideration of synchronization, we then made it such that only the master thread recalculates the bins for each of the moved particles. Therefore, we put a lock around the moved_particles variable. This gave a performance boost in comparison with the previous method. Also notable is the #pragma omp barrier used as an explicit barrier for all threads for each step in the main simulation loop.

Results

Different values of binsize, Parallel implementation

As with our serial implementation of the simulation algorithm, we considered various bin sizes.

parallel_binsize

The two plots above show the simulation times for increasing numbers of particles, for both the serial and parallel implementations. We see that in both implementations, a binsize equal to cutoff * 2 performs best, which is why we chose this value to be the dimensions of the bins in our implementations.

Serial vs Parallel Runtimes, log-log scale

serial_loglog

parallel_loglog

The runtime of both the serial and parallel implementations are linear O(n) in the log-log scale, as can be seen from the plots above. This is consistent with the expected behavior of the serial and parallel implementations, as the algorithm we used to do the particle simulation lowers the computational complexity of the program from O(n^2) to O(n).

Num_threads vs Runtimes, Parallel

numThreads

Performance increases as we add more threads, with a optimum seemingly reached at num_threads = 32, which achieves a speedup factor of around 19.3.

speedup_plt

Above is the speedup plot, with linear speedup represented as the orange line. After around 20 threads, the observed speedup no longer follows the linear pattern and levels off, suggesting perhaps that at this point, the overhead in parallelization (e.g. communication and synchronization methods) exceeds the speedup resulting from the parallelization.

Design choice note

Originally in designing our parallel algorithm, we had envisioned using nested parallelism. Instead of having to iterate through all bins and for each bin check the 9 bins around it, we would parallelize further. We would split the entire simulation grid up into tiles (much like the block tiling from matrix multiply), each tile being operated upon by a thread. Then, for each tile, the threads would then spawn a thread for each of the 9 surrounding bins, since applying the force from one particle upon all others can be parallelized so long as we fork and join for each particle in each bin. However, we quickly discovered that this type of nested parallelism is not recommended due to oversubscription. The number of threads would be too great compared to the number of cores, thereby increasing the parallel overhead.

Distributed memory algorithm, MPI

The key observation in reducing the running time complexity from quadratic to linear time was to levergae data locality and place nearby particles into bins. Each particle would then only need to apply force on particles in the 8 surround bins and also in the bin it belongs to. In consideration of the shared memory paradigm, we used OpenMP to write a multithreaded program such that each thread applied forces on a subset of bins, and in a critical section aggregated all forces by Newton’s second law and then moved the particles. This was part 1.

To leverage multiple processors connected through some interconnection network, we need to redesign the particle simulation algorithm for distributed memory. Our main decision was to consider communication overhead in light of computational running time complexity.

NOTE: In this paper, we use the terms root processor and master processor interchangeably. They are the main processor with rank == 0. Also used interchangeably are the terms processor and process.

Geometry & Row Abstraction

We consider simulation on a per-row basis, such that each processor is responsible for applying forces from all particles in a set number of rows of bins. We choose the row abstraction (e.g. as opposed to quadrants) because it is easy to identify the neighbors of a row. Because simulation of a force from a particle requires at most the 8 bins surrounding it and also the bin it is contained in, we need to consider all neighbors of a bin. By considering rows, we only have to worry about vertical neighbors, as horizontal neighbors are by definition included in the row.

If by formulation each processor receives a certain number of rows of bins, we must consider the particles that move into and out of the scope of the processor’s simulation.

Global & Local Scope of Calculations

Initally, all particles are broadcast to all processors, which then groups the particles into bins deterministically. All processors start with a trivially global view of the simulation. However, each time step makes each processor’s global view diverge. Processors simulate on only the rows allocated to it, but also consider the particles in the rows directly above and below it (e.g. its vertical neighbors). This is because they have the potential to move into and out of the rows in scope of the simulation.


if (rank != 0) {
  for (int j = 0; j < numBins; j++) {
    bin_t& temp_bin = allBins[(start) * numBins + j];
    for (int i = 0; i < temp_bin.size(); i++)
      toMove.push_back(temp_bin[i]);
    temp_bin.clear();
  }
}

...

//Delete neighbours as we will get them in the new timestep
if (rank != 0) {
  for (int j = 0; j < numBins; j++) {
    bin_t& temp_bin = allBins[(start - 1) * numBins + j];
    temp_bin.clear();
  }
}


In the above code snippet, we are clearing out the bins in the row directly above the start row index. We are also moving all particles in the row specified by the start index, since they have the potential to move outside of the simulated rows and move inter-processor.

Distributed Memory Algorithm

We implement the following distributed memory algorithm using MPI functions in C++. All calls distributing, gathering, scattering, etc. particles amongst processors are blocking, and is discussed in the next section.

  • Broadcast all particles to all processors, bearing large communication overhead to save later
  • All processors place particles into bins
  • All processors calculate start and end row indices, to define the rows of bins that each processor will be simulating the particles of
  • Simulate for N timesteps:
    • All processors compute forces for all particles for all bins in their rows from start to end row indices
    • (davg, navg, and dmin must be reduced when flag is set)
    • All processors calculate intra and inter-processor particle movement (with respect to the rows of bins each processor simulates)
    • All processors perform intra-processor moves
    • All processors consider vertical neighbors (top and bottom rows) and empty them
    • All processors send inter-processor moves, as well as information about the vertical neighbors, and gather at the root process
    • Root process gathers all information about moved particles between processors, and scatters back to all processors
    • All processors receive particles from root processor about inter-process moves and re-bins these inbound particles
    • All processors have new local state of particles and can continue with next simulation step

NOTE: while each processor starts with a copy of all particles, the majority of the view of each processor quickly becomes stale. Processors are only aware of the narrow band of rows it simulates on, as well as its vertical neighbors. All other particles in other bins/rows are inconsistent and do not affect the state of the simulation. Only the root process after gathering particle movements gains a global view of the entire simulation.

MPI Usage

We use the standard MPI message passing interface to communicate between processors. Primarily, our general pattern is to handle a majority of the global reduction and aggregation logic in a master root process, and distribute state to be worked on symmetrically by other processors. We use the MPI datatype PARTICLE, which is defined as six consecutive MPI_DOUBLE, as the serial analog consists of six c doubles representing position, velocity, and accelerator for both the x and y directions.


MPI_Datatype PARTICLE;
MPI_Type_contiguous( 6, MPI_DOUBLE, &PARTICLE );
MPI_Type_commit( &PARTICLE );


Initially, particles are initialized by the root process, since init_particles is not deterministic. The particles are then broadcasted to all other processors using MPI_Bcast(particles, n, PARTICLE, 0, MPI_COMM_WORLD) with the communicator MPI_COMM_WORLD. Once all processors receive the broadcast of all particles, they then place all particles into bins. The initial binning process is deterministic and can be replicated for all processors:


if( rank == 0 )
    init_particles( n, particles );

// broadcast all particles from root and put them into bins
MPI_Bcast(particles, n, PARTICLE, 0, MPI_COMM_WORLD);
putIntoBins(particles, allBins, n);


In the main simulation loop, when calculating the davg, dmin, and navg statistics, it is important to reduce the variables. This is done using the following code. Calls to MPI_Reduce use MPI datatypes and operators and reduce into the root processor 0:


if( find_option( argc, argv, "-no" ) == -1 )
{
  MPI_Reduce(&davg,&rdavg,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  MPI_Reduce(&navg,&rnavg,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
  MPI_Reduce(&dmin,&rdmin,1,MPI_DOUBLE,MPI_MIN,0,MPI_COMM_WORLD);
  if (rank == 0){
    //
    // Computing statistical data
    //
    if (rnavg) {
      absavg +=  rdavg/rnavg;
      nabsavg++;
    }
    if (rdmin < absmin) absmin = rdmin;
  }
}


The last major usage of MPI comes when considering the mechanism by which processes receive particles that move between bins. When particles move between bins, they can either move to a bin belonging to the same processor, or to a bin belonging to a neighboring processor.

Intra-processor movement can be handled locally without MPI. However, Inter-processor movement is trickier. This is especially the case when considering MPI’s inability to deal with varying data sizes. Therefore, when communicating inter-process moved particles, we must first communicate the number of particles that move out of a process, and only then can the receiving processor know the size of data to receive.


int sendSize = toMove.size();
int rcvSize[n_proc];

// master process collects sizes to determine offsets
MPI_Gather(&sendSize, 1, MPI_INT, rcvSize, 1, MPI_INT, 0, MPI_COMM_WORLD);  

The array rcvSize records the number of inter-process moves into the processor represented by the index of the array. In other words, processor i can expect rcvSize[i] inbound inter-process moved particles. Each processor records its own sendSize since only it knows the number of particles from its local simulation that move out of the current processor. We use a call to MPI_Gather to gather all processors’ values of sendSize into the array rcvSize in the root processor.

The goal is to scatter variable information amongst processors, depending on their simulation space, so we have to calculate offsets and sizes to pass into the scatter MPI call:

int offset[n_proc];
int totalSize = 0;
bin_t inboundParticles;

if (rank == 0) {
    offset[0] = 0;
    for (int i = 1; i < n_proc; ++i) {
        offset[i] = offset[i-1] + rcvSize[i-1];
        totalSize += rcvSize[i - 1];
    }
    totalSize += rcvSize[n_proc-1];

    inboundParticles.resize(totalSize);
}

Offsets are defined by the number of particles inbound before it, which is defined by the previous offset plus the previous size: offset[i] = offset[i-1] + rcvSize[i-1]. The counter totalSize exists to track the size of all received particles, which is why resize is called at the end. Also note that this subroutine is only run by the root process, which is rank 0. Afterwards, the root process calls MPI_Gatherv with the calculated arguments:


// after determining offsets, master process collects the particles to be moved
MPI_Gatherv(toMove.data(), sendSize, PARTICLE, 
    inboundParticles.data(), rcvSize, offset, PARTICLE, 
    0, MPI_COMM_WORLD);

After the root process gathers all the particles to be moved between processes, it must scatter that information. We can do this the same way as before, in which we must scatter information about the size before scattering. Before doing this, it’s easy to flatten all particles into a single vector, and specify offset and size as before. After extracting all particles into a vector, we call MPI_Scatter and MPI_Scatterv.


MPI_Scatter(rcvSize, 1, MPI_INT, &sendSize, 1, MPI_INT, 0, MPI_COMM_WORLD);
...
//ready to scatter with offset sizes
MPI_Scatterv(singleVector.data(), rcvSize, offset, PARTICLE, 
    foreignParticles.data(), sendSize, PARTICLE, 0, MPI_COMM_WORLD);

Above, we scatter all the sizes of the particles to be received by each process, and in the subsequent call, we scatter vectors of particles from the root process into all other processors. After these MPI calls, all processors have access in their memory the particles that have moved into their rows (e.g. foreignParticles). They then put these particles into bins and continue onto the next step of simulation. This is possible because all MPI calls to Scatter are blocking.

![normal]((https://rustie.xyz/res/img/normal.png){: .center-image }

After implementing MPI distributed memory algorithm, we achieved simulation times as seen in the figure above. Results were run on Cori using 16 processors. In varying the number of processors, we saw the below speedup. We ran the benchmark with 100,000 particle simulation:

![num_proc_linegraph]((https://rustie.xyz/res/img/num_proc_linegraph.png){: .center-image }

The simulation time generally is inversely proportional to the number of processors allocated to the distributed memory algorithm. There seems to be a point where computation overhead and communication overhead trade dominance near when the number of processors equals 8 and 16. There is a similar occurance early on with smaller number of processors too. Anomolies in this data are explained in the next section.

Considerations

Communication overhead

One key factor in achieving speedup in the particle simulation is to consider the communication overhead in comparison to computational complexity. Our initial assumption was that communication dominates computation. It should be that the more computation each process does before gathering back together in the root process, then the less the communication overhead counts in the overall speedup.

The cost of communication is also considered in regards to the scope of each processor’s simulation. It is assumed to be very costly to broadcast the updated state of all particles to all processors at each time step in the simulation. In our particle simulation implementation, we only broadcast all particles to all processors before the simulation happens. Afterwards, processes are only responsible for simulating and communicating particles from a narrow band of rows, thereby minimizing the communication done between processors.

The consideration of intra and inter-processor particle movement is one of the biggest factors affecting communication overhead, since particles are only communicated when they move between bins owned by different processors. Again, this only affects algorithmic performance when our assumption that communication overhead dominates compute overhead is true.

An optimization we tested to compare communication and computation overhead was to assume all particles on the edge of rows were going to move out of the row. Originally, we had only scattered the particles that we have calculated to move to a bin outside of a processor’s row, to increase computation in order to reduce communication later on.

opt_vs_normal

Conclusion

After testing various MPI implementation methods and schemes, we were able to achieve strong scaling efficiency of 0.26 and weak scaling efficiency of 0.24. Additional efficiency gains can be observed when combining multi processor schema such as MPI with multi threaded schema such as OpenMP as seen in part 1 of our report.

Especially notable in this project was when considering the communication versus computation overheads and plotting the results of simulation, we concluded that for small simulations, computational overhead dominates communication overhead. This could be the result of scheduling and traffic on Cori, as there was unusual difficulty getting jobs scheduled and run recently. This is not a general rule for all systems, and was the result of testing many times on Cori.

Future work will explore further speedup methods such as using GPU to accelerate computation.

Exploring Parallel Matrix Multiply

A superset analysis of CS267 graduate computer science course on parallel computing. Survey of DGEMM (Double Precision General Matrix Multiplication) speed-up boosts on CORI supercomputer. All results were calculated on a single core, single thread on Intel Xeon E5- 2698 v3 2.30 GHz CPU. Graphs show performance in percentages of maximum CPU utilization versus matrix dimensions. This was the first assignment of its kind that I've done, this being my first graudate level course. I find the course material extremely enjoyable: almost as enjoyable as seeing the code we write run faster and faster.

Naive

Throughout this study, we focused on optimizing percentage of max bandwidth usage with the standard 2n^3 matrix multiplication. Algorithms with lower number of operations done (e.g. different running time complexity) were not considered.

Blocking

We used the provided dgemm-blocked.c matrix multiply with blocking implementation as a starting point. First task was to find the optimal block size through testing.

Intel Xeon E5- 2698 v3 has 32KiB L1 cache per core, and theoretically, calculation of block size is as follows:

Note the usage of 3/4 matrices. We consider the upper and lower bounds of the memory requirements of storing A, B, and C matrices (e.g. in C:= C+A*B). In the upper bound case, the memory requirements of storing the entire matrices dominates the requirements of storing temporary variables and other data (e.g. O(n^2) space complexity vs constant) – so we only need to consider 3 matrices. This is the case for very large matrices. In the lower bound case, we consider when memory to store matrices is comparable to that of temporary variables, such as when the matrices are very small.

The matrix size and other various run-time factors make it such that the theoretical optimal block size is not necessarily the optimal in practice. Therefore, we run a benchmark on various block sizes: starting at 8x8 and incrementing the square dimensions up to 64x64.

Block size on performance

We found that the block size did not make a significant difference in performance; only that the larger block sizes did seem to have a slight improvement in performance. We decided to introduce AVX to see if we could find more significant performance improvement factors.

AVX Intrinsics

Intel Xeon E5- 2698 v3 2.30 GHz supports AVX2 with 256 bit vector width, meaning that we can fit up to 4 double precision floating point numbers in a vector. Initially we tried to vectorize the innermost loop of the blocked matrix multiply kernel. However, that method required a horizontal sum, which is not supported by AVX2.

Instead, we vectorized the second loop so that we could aggregate the sum of doubles in a register over multiple passes.


void do_block_avx(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (i = 0; i < M - M % 4; i+=4) {
    for (int j = 0; j < N; ++j) {
      // Compute C(i,j)
      __m256d c = _mm256_load_pd(C+i+j*lda);
      for (int k = 0; k < K; ++k) {
        c = _mm256_add_pd(
          c, 
          _mm256_mul_pd(
            _mm256_load_pd(A+i+k*lda), 
            _mm256_broadcast_sd(B+k+j*lda)));
      }
      _mm256_store_pd(C+i+j*lda, c);
    }
  }
  ...
}


Because we work on 4 doubles at a time, we need to take into account at the end of a matrix. We seperate our code in “parallel” (SIMD) and “serial” operations depending on whether we can use AVX2 or are forced to use the naive matrix multiply. The above code snippet shows our “parallel” loop, and below is our “serial loop”:


  for ( ; i < M; i++) {
    for (int j = 0; j < N; ++j) {
      // Compute C(i,j)
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k) {
        cij += A[i+k*lda] * B[k+j*lda];
      }
      C[i+j*lda] = cij;
    }
  }

We saw that block sizes of powers of two provided the best performance, with either 32 or 64 block size yielding the highest percentage CPU utilization. This makes sense because when chopping up the original matrices into blocks, any block size that was not a power of two or at least a multiple of 4 would not be able to be transformed properly into AVX intrinsics – there would be leftover serial operations on each of the blocks that could not be parallelized.

Loop Unrolling

In attempt to decrease the effects of branching, we unroll some of our loops. We combined this technique with AVX intrinsics:


void do_block_avx_unrolled(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (int j = 0; j < N; ++j) {
    for (i = 0; i < M - M % (UNROLL*4); i+=UNROLL*4) {
      // Compute C(i,j)
      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }
      for (int k = 0; k < K; ++k) {
        __m256d b = _mm256_broadcast_sd(B+k+j*lda);
        for (int x = 0; x < UNROLL; x++) {
          c[x] = _mm256_add_pd(
            c[x], 
            _mm256_mul_pd(
              b,
              _mm256_load_pd(A+lda*k+x*4+i)));
        }
      }
      for (int x = 0; x < UNROLL; x++) {
        _mm256_store_pd(C+i+x*4+j*lda, c[x]);
      }
    }
  }
  ...
}

Results from changing the block size were then much more apparent, as shown in the figure below.

block size on performance with AVX and loop unrolling

Performance for block sizes 32 and 64 seemed similar across tests, so we decided to pick the larger 64 so as to have less blocks as the matrix sizes increased in testing.

We tested various different valeus for unrolling, and in general noted that having an UNROLL value of a power of 2 maximizes our CPU usage. This makes sense since that enables AVX to fully take advantage of the most operations at once, since AVX2 here can operate on four double precision floating point numbers at once. Unrolling 8 at a time proved optimal in our case, but only for matrices up until around 500x500 size. Unroll sizes other than 4 and 8 gave very poor results.

Unroll size on performance with AVX and blocking

Fixed Multiply Add

Testing FMA (Fixed Multiply Add) intrinsics yielded worse results than with no FMA.


void do_block_avx_unrolled(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (int j = 0; j < N; ++j) {
    for (i = 0; i < M - M % (UNROLL*4); i+=UNROLL*4) {
      // Compute C(i,j)
      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }
      for (int k = 0; k < K; ++k) {
        __m256d b = _mm256_broadcast_sd(B+k+j*lda);
        for (int x = 0; x < UNROLL; x++) {
          // c[x] = _mm256_add_pd(
          //   c[x], 
          //   _mm256_mul_pd(
          //     b,
          //     _mm256_load_pd(A+lda*k+x*4+i)));

          c[x] = _mm256_fmadd_pd(b, _mm256_load_pd(A+lda*k+x*4+i), c[x]);
        }
      }
      for (int x = 0; x < UNROLL; x++) {
        _mm256_store_pd(C+i+x*4+j*lda, c[x]);
      }
    }
  }
  ...
}

Choice of Compiler

Cori supports compiler modules for Intel and GNU standard compilers. We ran tests of standardized block size of 64 and unroll size of 8, against choice of ICC (Intel C Compiler) and GCC (GNU C Compiler). Results are shown below

ICC vs GCC

Compiler Options

-O2 Optimize Option

-O2 compiler optimize option output is shown below. It optimizes various loops in our code in dgemm-blocked.



Analyzing loop at dgemm-blocked.c:154

Analyzing loop at dgemm-blocked.c:156

Analyzing loop at dgemm-blocked.c:159

Analyzing loop at dgemm-blocked.c:109

Analyzing loop at dgemm-blocked.c:110

Analyzing loop at dgemm-blocked.c:138

Analyzing loop at dgemm-blocked.c:126

Analyzing loop at dgemm-blocked.c:128

Analyzing loop at dgemm-blocked.c:113

Analyzing loop at dgemm-blocked.c:362

Analyzing loop at dgemm-blocked.c:363

Analyzing loop at dgemm-blocked.c:366


Shown in the figure below is a comparison of GCC and ICC using the -O2 optimize option flag.

O2 GCC v ICC

-O3 Optimize Option



Analyzing loop at dgemm-blocked.c:154

Analyzing loop at dgemm-blocked.c:156

Analyzing loop at dgemm-blocked.c:159

Analyzing loop at dgemm-blocked.c:109

Analyzing loop at dgemm-blocked.c:110

Analyzing loop at dgemm-blocked.c:126

Analyzing loop at dgemm-blocked.c:362

Analyzing loop at dgemm-blocked.c:363

Analyzing loop at dgemm-blocked.c:366


We found that -O2 compiler flag analyzed more code than the -O3 compiler flag.

O3 GCC v ICC

We found that though -O2 analyzed more code than -O3, the resulting performance was very similar, and that differences were negligible. We choice to use -O2 and carried forward with other compiler options.

O2 v O3

-Ofast

However, comparing performance of -O2 and -O3 versus another compiler optimizer option -Ofast, the differences were more pronounced, and thus justifying our choice of -O2.

O2 v O3 v Ofast

Other options

After various rounds of testing, it seemed that the flags -ftree-vec-info-optimized fopt-info-vec-optimized -mavx2 provided the best results.

Compiler vectorization

Compiler options -qopt-report=1 -qopt-report-phase=vec reveals the compiler’s automatic vectorization operations. Output for our dgemm-blocked.c file is shown below:



Begin optimization report for: do_block_avx_unrolled(int, int, int, int, double *, double *, double *)

    Report from: Vector optimizations [vec]


LOOP BEGIN at dgemm-blocked.c(109,3)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at dgemm-blocked.c(110,5)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at dgemm-blocked.c(126,7)
         remark #25460: No loop optimizations reported

         LOOP BEGIN at dgemm-blocked.c(128,9)
         LOOP END
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(138,7)
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(113,7)
      LOOP END
   LOOP END
LOOP END

LOOP BEGIN at dgemm-blocked.c(154,3)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at dgemm-blocked.c(156,5)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at dgemm-blocked.c(159,7)
      <Peeled loop for vectorization>
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(159,7)
         remark #15300: LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(159,7)
      <Remainder loop for vectorization>
      LOOP END
   LOOP END
LOOP END


The output revealed that no vectorization was done for our inner matrix multiply function, meaning that our code was already optimally vectorized.

Note

Transpose

The overhead of transposing the entire matrices before performing our standard matrix multiply was too great. Transposing and multiplying a row of A vs a column of B also brought horizontal sum back into consideration, and since that operation is high overhead, we did not see good results from transposing.

Loop Reordering

Considering loop ordering, it is important to order the standard three nested loops in such a way as to minimize cache misses. Seeing as the innermost operations depend on addition and multiplication of values to calculate new indices upon which to operate, it was clear that indices of each matrix should not change too much at a time. In othe words, we take advantage of spatial locality. In the end, we noted that loop reordering does not increase our score nearly as much as the following techniques. The reason is especially due to consideration of AVX intrinsics and loop unrolling – techniques which increase the amount of work per memory access.

Conditionals

We noticed that performance of matrix multiply with our AVX2, blocking, loop unrolling, etc. optimizations varied depending on the size of the input matrices. Particularly, matrix multiply with matrices smaller than 32x32 yielded very poor results. (e.g. size 31 was often near 10x slower than 32x32).

However, as with the matrix transpose, we found that the overhead of having a conditional to check the input matrix size was too great to provide any benefit. This is due to CPU branch predition. Perhaps this is a non-issue with very large matrices when the overhead is dominated by any benefits of updated settings.

Loop Unrolling Alternatives

Alternatively to loop unrolling with another for loop, we pasted the loop unrolled code 8 times:


      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }

      // VS

      c[0] = _mm256_load_pd(C+i+0*4+j*lda);
      c[1] = _mm256_load_pd(C+i+1*4+j*lda);
      c[2] = _mm256_load_pd(C+i+2*4+j*lda);
      c[3] = _mm256_load_pd(C+i+3*4+j*lda);
      c[4] = _mm256_load_pd(C+i+4*4+j*lda);
      c[5] = _mm256_load_pd(C+i+5*4+j*lda);
      c[6] = _mm256_load_pd(C+i+6*4+j*lda);
      c[7] = _mm256_load_pd(C+i+7*4+j*lda);


This gave a worse performance result.

After all these tests, we continued to experiment with different block sizes. We were exploring different ways of blocking and found an optimal way through multiple trial and error. We then tried to unroll all A, B and C matrices manually. It gave a slight increase in performance. Then we realised that we could further improve the performance if we can replace the final residual calculation part (If the matrix size is not a multiple of 4, we have extra residue)

for (int j = 0; j < N; ++j) {
    i = ii;
    for ( ; i < M; i++) {
      // Compute C(i,j)
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k) {
        cij += A[i+k*lda] * B[k+j*lda];
      }
      C[i+j*lda] = cij;
    }
  }
  

Hence we added some extra “safe” space to the matrix to make it a multiple of 8. After reordering the matrix to be a multiple of 8, we ran the same calculations as before. This gave a significant performance boost. After calculation, we simply reordered to remove the extra space from the result matrix.

  add_extra_space(A, origA, lda, extra_space);
  add_extra_space(B, origB, lda, extra_space);
  add_extra_space(C, origC, lda, extra_space);

Final results

We believe that if we can avoid duplication of matrices, we can get a slightly higher performance. The results can be seen below.

The above was our original that gave optimal performance. Below pasting it 8 times yielded worse performance. We hypothesize that is because of increased space requirements, showing that fundamentally loop unrolling is a space vs time tradeoff. Already having the for loop to unroll already optimized the benefit of loop unrolling as understood by the compiler.

Conclusion

We were able to optimize DGEMM performance on a single core, single thread on Intel Xeon E5- 2698 v3 2.30 GHz CPU to an average of 35% on Cori. Various strategies including loop unrolling, blocking, using AVX intrinsics, and compiler flags were explored in depth.

Optimal performance with our code can be reached using the following compiler options:

cc -O2 -funroll-loops -march=core-avx2 dgemm-blocked.c

Variation in testing environments could have had an effect on our test results. Outliers in data were observed when Cori was under load by other users. Also, the code was tested outside of Cori on various other hardware including Intel i5-7360U CPU @ 2.30GHz on 2017 MacBook Pro and achieved a surprising performance average of 43% across our benchmarks. This was probably due to lack of intese compute on our local computers.

This final figure shows our final max performance (built off of the square_dgemm function in dgemm-blocked.c) against the provided “naive blocking” function in the provided code before block size tuning.

Final results

References

  1. https://sites.google.com/lbl.gov/cs267-spr2019/hw-1
  2. http://www.nersc.gov/users/computational-systems/cori/
  3. http://www.nersc.gov/users/computational-systems/edison/programming/vectorization/
  4. http://www.nersc.gov/users/computational-systems/edison/configuration/
  5. https://gcc.gnu.org/onlinedocs/
  6. https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html
  7. https://software.intel.com/en-us/isa-extensions
  8. http://theeirons.org/Nadav/pubs/MatrixMult.pdf
  9. http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100