Same ramblings and ruminations in the moment.

Decentralized Index of Knowledge

Background

The index of knowledge project holds a special place in my heart since it’s always been a B@B initiative parallel to my own. When I helped teach the DeCal in Fall 2017, we had a lot of interest in compiling a master list of resources that us at B@B considered good learning materials – an opinionated/curated list of resources.

It seems intuitive to have a central repository of resources serving as a baseline truth from which to write our course material, train our members, administer certifications, etc.. but every time it was revived, the project died within a semester. Difficulty to maintain, centralization, and inaccessiblity all contributed to its downfall. From this point onwards it was colloquially deemed a “cursed” project that no one ever wanted to pick up – at least in my mind. Some more history of the project are in the linked powerpoint above.

Note

These are the observations of a “computer scientist” (actually I just hack projects together…). I use some terminology that might not be colloquial or accurate, but make somewhat sense to me.

Analysis

Last semester, coinciding with proposed redesigns of the Blockchain Fundamentals DeCal, as well as the revamp of the Blockchain for Developers DeCal, I gained an interest in using computer science for education. I had pitched some ideas to Armando Fox of the ACE (Algorithms for Computing Education) research lab in attempts of inspiring interst in the area for a potential grad school research focus, but none of the existing grad PhD students wanted to pursue an idea different from their own research focus – only remarking that it was an interesting idea and that similar ideas have been explored before in the area of personal knowledge indices, or using ontologies to generate “wrong” answers to multiple choice questions. Basically no one wanted to work on something different, so I decided to investigate in my own free time.

Having no real formal training in education, but having been a student for many years and observing how I learn the best and how good and bad teachers/professors do their jobs, I decided to take an algorithmic perspective. I take the modern belief that personalized education is the future, and it seemed that from a fundamental (simplistic if you will) perspective, students differ in their initial pursuit of new knowledge. This boils down particularly to how they explore the knowledge space given to them. A simple way to think about this (and perhaps a cool experiment to conduct) is: given a topic to investigate, a test to take, and the entirety of some linked data set (e.g. Wikipedia), how would they study? e.g. would they attempt to understand individual components of study, or rather the big picture at first, a mix of both, etc..

I remember reading some articles (or was it my mom and her friends on Line..) on the difference of Eastern and Western perspectives on academia, particularlly mathematics. Of course it was a generalization, but the claim was that in the East, students learn from the bottom up, cramming problem sets out mechanically until they finally came up with the patterns or algorithms themselves, whereas in the West, teachers spend a great deal of time helping students understand “main concepts”, as is apparent by their constant pursuit of new ways of teaching math (a la common core.)

If we graph bits of knowledge in a directed graph, so as to represent a prerequisite for learning, it seems that each student would benefit from identifying unique traversal.

The idea was to present knowledge as an unopinionated graph and to represent learning as a traversal of that graph.

A parallel initiative last semester worked on writing a proof-of-concept blockchain textbook for non-technical audience. They were in comparison highly opinionated in what, how, and in what order they presented information, but it was a good case study in my mind of pedagogical traversal.

It was more of a level order traversal, in that it would consider the big picture of blockchain at a complete view but at low resolution, and then refine that big picture more and more. In other words, a BFS where each depth is a new abstraction layer lifted. The book, when completed, would still not traverse too deep, so as to appeal to the younger non-technical audience. However, it’s clear that if it were written for a more technical audience, the traversal would go deeper more quickly, especially considering students with more knowledge in certain areas, etc.

Method

The first reboot of the IoK I authored was a graphical PoC. It featured rich graphics by the cytoscape graph library, and users could add/edit nodes and edges with Google authentication. It had minimal integration with Google Docs, and could also generate markdown files, which could be rendered into pseudo-awesome-lists. The underlying “data-store” was a JSON file on GitHub.

I had plans of using GitHub’s public sharing and PR system as a “decentralization” layer, such that interested contributors could fork the repo, run a script to add nodes/edges to the IoK (and generate metadata), and make a PR of standard format. Then verified collaborators on the main IoK project could then accept/decline the PRs as a Proof-of-Authority. Users would of course always be welcome to fork the IoK to run their own.

Everything was clunky though, and use of Github, running scripts, etc., seemed like a poor UX, especially if I had ideas of making the IoK a more generalized digital pedagogical tool.

Decentralize it

Previous iterations of the Index of Knowledge allowed a system of editing similar to that require a Proof of Authority. While it worked in concept, by the time it got to getting people to adopt it, it was difficult to maintain, especially everything had to be audited by a core group of people. Like moving from testnet to prod, we needed to scale by decentralization. GitHub forks gave this to us for free, but I was looking for something less hacky, and more decentralized – after all, everything hosted on GitHub is kind of cheating…

I had learned about and played with Blockstack at Hack Davis, where we used it for decentralized identity management and data streaming (via personal data lockers), and it seemed like a good fit for the IoK. Also didn’t hurt that I was freshly graduated (and unemployed) with nothing too exciting going on. aka every day was like a hackathon.

The idea was to move the graph data for the IoK from GitHub to individuals’ own data lockers on Blockstack. Users could then use that to define their own IoKs. Users can also view each others’ IoKs and fork them to make their own edits.

I’ve also written a small webscraper that visits an IoK (e.g. my own), downloads the graph data, and linearizes it into an awesome-list to be commited back as a README.md file onto GitHub. Currently, this is automated in Travis build job to save myself money, but ideally we could write some webhooks for this, or do it on an interval with Cron.

Currently, we can fork IoKs, but can’t merge them back together. I’m writing a small opinionated webserver that performs simple graph join operations. Say another user has an IoK with a subgraph you really want to include in your own (e.g. a whole mini-IoK about MakerDAO). With this, you could borrow that and merge it with your own IoK.

The roadmap

I’m currently working on this project in my spare time, but am always looking for help! It’s by far the biggest personal project I’ve started and maintained for this long, and I’d like to see it go somewhere.

The IoK can be found on GitHub. It’s still very hacky and in indefinite Beta. Currently some B@B education members are contributing for their internal project requirement, and I think it’s a good way for them to get some hands on programming experience.

Alternative architectures

I have a tendency to switch between architectures on a whim, esp. with this being a personal project that started off with little documentation and just my brain going wild. While this project still has very little documentation, here are some notes I can take off the top of my head:

GitHub entirely vs blockchain-based systems:

  • Mainly for hype, hackathon, and most importantly: UI/UX
  • We could have built IoK on top of GitHub to begin with
  • Seems harder since it’s not a storage as first-class design..
  • Had some initial ideas about writing a GitHub storage plugin for Blockstack, using it as a pseudo-cloud data locker, but that seems likea big project for another time.

UI: graph first vs text first:

  • Initial IoK projects contained a lot of dialoge amongst team members regarding the UI: e.g. should it be like a big list of resources, or something cooler like a graph? – I thought, why not both!??
  • A cool idea for the current IoK would be to design a wikipedia-like article for each topic, linking to resources and other topics within the IoK. We could then walk this text, explore the links, and generate the graph as second-class citizen. Seems kind of backwards…
    • Maybe we could have something more relaxed: like in addition to a description, a separate field accompanying each dependency containing a sentence about why it’s a dependency.
    • For example: Bitcoin depends on cryptography. Cryptograhpy supports authentication, integrity, and non-repudiation of identity and transactions…
  • Not sure how much original text we would have, so the wiki-style page would inevitably converge to a minimal design like we have, but still generated from the graph itself. Seems a bit backwards to me esp in the data storage realm.

Firebase suite vs Rust on cloud deploy/etc:

  • Firebase is easy and we’ve done it before
  • Edu project as a means to learn to teach and teach to learn
    • Here “teaching” is making the IoK as an educational tool
    • This maps to actually making the IoK, and picking new skills/tools to build the project, respectively
  • Rust is just cool and I want more excuses to write Rust code. Fight me.

Goodbye Berkeley

The brief period of unemployment between graduation and the starting date at my first full-time job out of college is basically purgatory. Especially with the graduation ceremony only happening in the Spring following my graduation, it feels even more so. There are many loose ends, with research, club involvement, and connections being the biggest contributors to my continual connection with campus. But as the dust settles in this chaotic phase of my life, I can only slowly realize that the last few years of my life have only been to set the default of my continual success in both personal and professional matters.

Just a quick reflection on some major learnings and takeaways.

Academia vs Industry (aka the pursuit of impact)

I struggled to narrow down my true interests in terms of career path. It’s definitely a very naive and perhaps privileged way of thinking – that one can prioritize making a meaningful, positive impact. Those are just words, and while it’s possible, it’s important to think of goals in varying levels of granularity. Of course impact might be the eventual goal, but there are smaller and more immediate ways to achieve impact that may be somewhat collinear in trajectory.

Somewhere along the line, perhaps idolization of professors and other academics, I found myself on the grad school track. Especially in the blockchain space I had so suddenly found myself immersed in, the community is stratified by product/technology, founders of which often times reach demi-god-like status. My reverence for certain professors had reached similar heights. I wanted to make big impact by focusing on big picture topics in academic research – somewhat overlooking some of the prerequisite engineering rigor.

After seeking advice from others, I eventually realized my prioritization of applied vs pure sciences. While we need all types of thinkers, I came to understand my strengths in working in a more instant gratification impact manner. I will use my academic training combined with engineering rigor to solve real-world problems. Grad school will be on the roadmap eventually, but for now, I’d rather see for myself the problems that need solving. Then, if I deem it only possible to solve these problems with the leverage of academia and graduate studies, then so be it – chase that impact.

Mindfulness and parallelization of the brain

Coming into college, I knew immediately that there would be a billion things to do and only the time for a few of them. Like attempting too large of a plate during Thanksgiving dinner, it’s important to pace yourself. If you frontload all your responsibilities and end up underperforming or underdelivering, that’s worse than exceling at only a few items. A few more takeaways here:

  • Avoid context switching
  • Attempt tasks on a rolling basis

In other words, prioritize, order, and pipeline. In my life, the implications are that my daily TODO lists are not too long, ordered, and consider future days’ TODOs as well (e.g. a lower granularity weekly/monthly/yearly TODO). This also plays directly into my personal definition of practical mindfulness.

Others

  • Take nothing for granted
  • Don’t mistake motion for progress
  • Make the most of where you are

Time is fake

"Time has been invented in the universe so that everything would not happen at once."

Distributed systems

It all started with researching distributed systems safety, and the tools used to test it: namely Jepsen and WANem. The concept of time in the coordination of systems has always interested me. Especially that of distributed systems and its extension to the more philosophical.

The universe is thrashing

When your computer hits a bottleneck in compute, everything else seems to slow down. That’s because every process on your computer is sharing time.

When your computer hits a bottleneck in memory, everything also seems to slow down. That’s due to consistent paging in and out of memory - thrashing.

These two scenarios are bottlenecks in the handling of information. It’s clear how memory (and the memory hierarchy) represent information. Compute can also represent information, just over time. And the common way to handle information overload is to slow down time. To keep the flow of new information consistent as we encounter high density pockets of information. For computer hardware, there’s a physical bottleneck - and upper bound on the information throughput that is processable.

The universe also has this inherent bottleneck - a bottleneck on the rate of information processing - the speed of light.

As you get close to a black hole, the immense mass bends space time and theories of special relativity suggest that time slows down for you. Mass is information, at least according to Maxwell. In other words, in an attempt to process so much information, the universe is thrashing.

But this phenomena only occurs to the observer, which begs the question: is the universe thrashing, or is the observer’s observation (or the observer themselves) thrashing? Is there such a distinction?

Exploring Space(time?)

It was a peaceful summer evening sitting with the family and listening to bossa nova on Spotify. But my mind once again drifted to thoughts of the essence of space and time. The music playing through the speakers has itself no understood notion of time. Yet the delayed playback of the music brought about its melodies and harmonies. Downloaded music is just a collection of bits, existing all at once in space. But by interpreting it as a data stream with a particular bitrate, imbuing it with an artificial notion of change and time, we’re able to enjoy the music fully as humans. Imagine being handed a CD or USB with music (or anything else) loaded on it and being able to fully understand it simply by glancing (or sensing?) it, perhaps akin to static analysis.

In any case, the encoding of data to a physical form separable from time reminded me of that work of Roman philosopher Lucretius. He stated that the universe consisted of atoms, and that a (random) atomic swerve randomized their otherwise deterministic movement. Movement is of course closely tied with the notion of time, for there is no motion without the passage of time. Perhaps then we could view motion as an artifact of the limited processing capacity of the human brain: needing to invent notions of time and change to process an otherwise static universe. Perhaps all atoms, as Lucretius suggests, exist eternally. But their movement caused by natural trajectory and atomic swerve are artifacts of both human presence and intervention. The human brains’ processing of the universe inadvertantly invented time.

The Big Picture

I’ve been stuck in the weeds recently learning all these specific technologies, and while there’s merit in that, sometimes I need to take a breather and explore the big picture. Everything needs a clear direction and in that sense the big picture of systems and the applications and workloads they support is this guiding north star.

Scaling Distributed Systems

Design

Fixing bugs takes order 10 to 100x longer time than it takes to account for the bug in design and work a solution around it.

Also have to determine in design which dimensions you can possibly scale at. There’s also the concern of whether you solve issues now in design or solve them up front as they arise, hitting some scaling wall.

Microservices

Microservices is a buzzword, and there’s no real cutoff between services and microservices in regards to compute, storage, other metrics, etc. Instead, it’s more valuable to think of pushing down the scale of any service you write. Smaller services are easier to swap in and out. For example, if the initial design was flawed or a new better way to solve a problem is discovered, small scale services provide the modularity needed to easily swap out individual parts of a service.

There’s a bit of a paradox here:

  • Microservices increase scalability by modularity, which means loose coupling
  • Scalability can be achieved by tight coupling of individual components

Moral of the story is that things will need to be titrated in production.

What Picture? Tangent on how to make impact

Note: A bit of a tangent, and me jotting down a bunch of ideas/ranting:

Lamport implies various times in his Turing Award paper “The Computer Science of Concurrency: The Early Years” that there’s a huge gap between engineering solutions to concurrency issues and the study of concurrency in computer science. Dijkstra studied and published fault tolerances and self stabilization in 1974, without having the direct application of his work in mind. Perhaps he (and others) anticipated fault tolerance to be a major field of study within computer science and thus chose to pursue the study, but it’s hard to know for certain. It’s more worth noting (personally) that early systems implemented solutions to concurrency and fault tolerance in distributed systems from the engineering perspective, without consulting the formalism of scientific work such as that of Dijkstra.

As Lamport puts it, “A survey of fault tolerance published in 1978 does not mention a single algorithm, showing that fault tolerance was still the province of computer engineering, not of computer science.” [2]

That is to say that there’s an obvious tradeoff between engineering and science-ing solutions – in many aspects. One is driven by the need for a solution so as to solve an issue with an application or domain focus. The other is driven by the pure scientific pursuit. Of course there might be some driving application or foresight/insight, but science often abstracts away the usages of its findings. Papers that do not are often either whitepapers describing something that has already been built (in which case it might as well be a high level system spec) or pressured somehow (e.g. by source of funding) to append notes on future work or applications.

In terms of the cycle of impact-making: it seems from my perspective that when problems first arise, a solution is hacked first. Unless it’s a very fundamental problem, it’s often not worth it to pay the cost of all the abstraction overhead and design from the top down again (scientifically). Problems are run into in industry or other engineering circumstances, and should be dealt first in that context. If there seem to be recurring problems with the solution, or similar problems keep arising, then it seems like the best way to amortize future time spent fixing problems is to subject the nature of the problem in question to more rigorous scientific study. In general, we work up the stack: assuming problems arise in engineering and not just scientific ideation, impact is first made directly facing the application, and then later escalated to pure scientific pursuit if necessary.

Adopting a science-only or engineering-only policy seems suboptimal, since you’ll either always be upper or lower bounding the abstraction layers you’re working on. Instead, pick up the skillset to work in between. Then brings up the question of whether workflow should be communication bound or compute bound. I mean communication in terms of ferrying work and reults between engineering and science workflows, and compute in terms of actual time and engergy spent in engineering and science. It’s definitely hard to translate scientific works to the real world. Academics often deal in situations of theory and absolutes and assumptions. And when it comes to implementation, that’s when ideas meet reality. Perhaps a happy middleground to focus on would be academically-focused engineering.

References

  1. Hidden Scaling Issues of Distributed Systems
  2. The Computer Science of Concurrency: The Early Years

Exploring Parallel Particle Simulation

Summary

The main problem is to simulate the interactions of particles within a finite space. Particles exert forces of repulsion on each other and can also bounce off the walls of the simulation grid.

nbody simulation with cutoff

Here, we present iterative improvements on a toy particle simulation. As with the previous post on parallelization, all results were calculated on Intel Xeon E5- 2698 v3 2.30 GHz CPU, this time across multiple processors. Graphs show performance in per processor $p$ speedup and also simulation time against the number of perticles and number of processors allocated.

We present three main algorithms:

  • Serial algorithm
  • Shared memory algorithm, OpenMP
  • Distributed memory algorithm, MPI

Serial Algorithm

We can reduce the running time complexity of our particle simulation to O(n) if for every particle we only consider the particles nearby. To formalize this, we split the simulation grid into bins and assign each particle to a bin based on its location in the grid. Then, for each particle in each bin, we consider the particles in its own bin as well as those in adjacent bins. So in total, there are 9 bins to consider. This reduces the standard O(n^2) running time complexity to O(n) due to the constant 9 bins.

To optimize the number of bins, we need to consider the effective area around each particle that applies the repulsive force. This is defined as cutoff. The dimensions of each bin should thus be a function of cutoff.

In implementation, we define datatype bin_t as follows, where a bin_t is a vector or particle:

typedef std::vector<particle_t> bin_t;

We choose store a vector of particles rather than a vector of particle pointers due to concurrency issues for when we parallelize.

Before the main simulation loop, we place all particles into their respective bins based on their starting x and y coordinates with some simple geometry:

//
// Puts particles into vector of all bins
//
void putIntoBins (particle_t* particles, vector<bin_t>& allBins, int n) {
    gridSize = sqrt(n * DENSITY);
    numBins = int(gridSize / binSize)+1; 
    allBins.resize(numBins * numBins);
    for (int i = 0; i < n; i++) {
        int x = int(particles[i].x / binSize);
        int y = int(particles[i].y / binSize);
        allBins[x*numBins + y].push_back(particles[i]);
    }
}

When considering the bins at the edge of the simulation grid, we do not have to consider a full 9 bins, but rather only 4 or 6 bins depending on whether the bin in consideration is on an edge or a corner. There are a few approaches to solving this issue:

  1. Using conditionals to check whether the bin is a corner or edge and then performing the appropriate simulation steps
  2. Padding the simulation grid with one layer empty bins and change bin iteration from [0,n) to [1,n-1) where n is the grid dimension

We choose to pursue the first approach so as to avoid additional memory usage for simulations of large size, since padding is linear to the dimensions of the simulation grid.

After simulating and applying force to each particle, we then move the particles to their new bins in the next simulation step. We choose to recycle the bins data structure.

Grid corner checking

In our main simulation loop, we consider all 8 surrounding bins around a single bin, but use conditionals to check for bounds. This is implemented as follows:

  int dx[3] = {-1, 0, 1};
  int dy[3] = {-1, 0, 1};

  for (int dx_index = 0; dx_index < 3; dx_index++) {
      for (int dy_index = 0; dy_index < 3; dy_index++) {
          if (i + dx[dx_index] >= 0 
              && i + dx[dx_index] < numBins 
              && j + dy[dy_index] >= 0 
              && j + dy[dy_index] < numBins) {
            ... // apply forces
          }
      }
  }

The arrays dx and dy represent the x and y coordinates of the surrounding bins, and we use simple bound checking with respect to the simulation grid in the innermost conditional.

Additional bin size considerations

We observed that bin size should be a function of the cutoff distance of each particle. Initially, we noticed that if we make the dimensions of each bin cutoff x cutoff, then in the worst case each particle would only have effect on the 9 bins as previously specified. We tested various bin sizes:

serial_binsize

All simulation times are measured in seconds on the y-axis. We noted that in implementation, having a bin size of cutoff * 2 consistently provided higher performance than previously speculated size of simply cutoff. We hypothesize this is due to the cost of potentially checking 9 bins compared to the relatively lower cost of our conditional checking. If bin size is cutoff * 2, there is a good chance that a particle’s force field of cutoff radius will not cross into all 9 bins. Having a bin size of cutoff * 2 maximizes on this, while avoiding making the bins too big so as to negatively affect the memory performance.

Shared Memory Algorithm, OpenMP

We used OpenMP to parallelize the particle simulation, and mostly kept the algorithm the same as the one in serial, with some minor adjustments for synchronization. Firstly, we use the parallelization option private(dmin) to make each thread have a private copy of the dmin variable. Similarly, for davg and navg, we use #pragma omp for reduction(+:navg,davg) in the simulation loop when moving particles. This is after noting that apply_force in common.cpp aggregates navg and davg using addition.

Synchronization

We also introduce an additional data structure bin_t moved_particles to keep track of the moved particles for each thread. We tried considering having different bin_t moved_particles for different threads and collecting the results thread by thread in the end, rather than using a lock around the variable.

parallel_synch

The above graph shows the results we achieved when we tried using different bin_t moved_particles for different threads and culminated them in the end.

In consideration of synchronization, we then made it such that only the master thread recalculates the bins for each of the moved particles. Therefore, we put a lock around the moved_particles variable. This gave a performance boost in comparison with the previous method. Also notable is the #pragma omp barrier used as an explicit barrier for all threads for each step in the main simulation loop.

Results

Different values of binsize, Parallel implementation

As with our serial implementation of the simulation algorithm, we considered various bin sizes.

parallel_binsize

The two plots above show the simulation times for increasing numbers of particles, for both the serial and parallel implementations. We see that in both implementations, a binsize equal to cutoff * 2 performs best, which is why we chose this value to be the dimensions of the bins in our implementations.

Serial vs Parallel Runtimes, log-log scale

serial_loglog

parallel_loglog

The runtime of both the serial and parallel implementations are linear O(n) in the log-log scale, as can be seen from the plots above. This is consistent with the expected behavior of the serial and parallel implementations, as the algorithm we used to do the particle simulation lowers the computational complexity of the program from O(n^2) to O(n).

Num_threads vs Runtimes, Parallel

numThreads

Performance increases as we add more threads, with a optimum seemingly reached at num_threads = 32, which achieves a speedup factor of around 19.3.

speedup_plt

Above is the speedup plot, with linear speedup represented as the orange line. After around 20 threads, the observed speedup no longer follows the linear pattern and levels off, suggesting perhaps that at this point, the overhead in parallelization (e.g. communication and synchronization methods) exceeds the speedup resulting from the parallelization.

Design choice note

Originally in designing our parallel algorithm, we had envisioned using nested parallelism. Instead of having to iterate through all bins and for each bin check the 9 bins around it, we would parallelize further. We would split the entire simulation grid up into tiles (much like the block tiling from matrix multiply), each tile being operated upon by a thread. Then, for each tile, the threads would then spawn a thread for each of the 9 surrounding bins, since applying the force from one particle upon all others can be parallelized so long as we fork and join for each particle in each bin. However, we quickly discovered that this type of nested parallelism is not recommended due to oversubscription. The number of threads would be too great compared to the number of cores, thereby increasing the parallel overhead.

Distributed memory algorithm, MPI

The key observation in reducing the running time complexity from quadratic to linear time was to levergae data locality and place nearby particles into bins. Each particle would then only need to apply force on particles in the 8 surround bins and also in the bin it belongs to. In consideration of the shared memory paradigm, we used OpenMP to write a multithreaded program such that each thread applied forces on a subset of bins, and in a critical section aggregated all forces by Newton’s second law and then moved the particles. This was part 1.

To leverage multiple processors connected through some interconnection network, we need to redesign the particle simulation algorithm for distributed memory. Our main decision was to consider communication overhead in light of computational running time complexity.

NOTE: In this paper, we use the terms root processor and master processor interchangeably. They are the main processor with rank == 0. Also used interchangeably are the terms processor and process.

Geometry & Row Abstraction

We consider simulation on a per-row basis, such that each processor is responsible for applying forces from all particles in a set number of rows of bins. We choose the row abstraction (e.g. as opposed to quadrants) because it is easy to identify the neighbors of a row. Because simulation of a force from a particle requires at most the 8 bins surrounding it and also the bin it is contained in, we need to consider all neighbors of a bin. By considering rows, we only have to worry about vertical neighbors, as horizontal neighbors are by definition included in the row.

If by formulation each processor receives a certain number of rows of bins, we must consider the particles that move into and out of the scope of the processor’s simulation.

Global & Local Scope of Calculations

Initally, all particles are broadcast to all processors, which then groups the particles into bins deterministically. All processors start with a trivially global view of the simulation. However, each time step makes each processor’s global view diverge. Processors simulate on only the rows allocated to it, but also consider the particles in the rows directly above and below it (e.g. its vertical neighbors). This is because they have the potential to move into and out of the rows in scope of the simulation.


if (rank != 0) {
  for (int j = 0; j < numBins; j++) {
    bin_t& temp_bin = allBins[(start) * numBins + j];
    for (int i = 0; i < temp_bin.size(); i++)
      toMove.push_back(temp_bin[i]);
    temp_bin.clear();
  }
}

...

//Delete neighbours as we will get them in the new timestep
if (rank != 0) {
  for (int j = 0; j < numBins; j++) {
    bin_t& temp_bin = allBins[(start - 1) * numBins + j];
    temp_bin.clear();
  }
}


In the above code snippet, we are clearing out the bins in the row directly above the start row index. We are also moving all particles in the row specified by the start index, since they have the potential to move outside of the simulated rows and move inter-processor.

Distributed Memory Algorithm

We implement the following distributed memory algorithm using MPI functions in C++. All calls distributing, gathering, scattering, etc. particles amongst processors are blocking, and is discussed in the next section.

  • Broadcast all particles to all processors, bearing large communication overhead to save later
  • All processors place particles into bins
  • All processors calculate start and end row indices, to define the rows of bins that each processor will be simulating the particles of
  • Simulate for N timesteps:
    • All processors compute forces for all particles for all bins in their rows from start to end row indices
    • (davg, navg, and dmin must be reduced when flag is set)
    • All processors calculate intra and inter-processor particle movement (with respect to the rows of bins each processor simulates)
    • All processors perform intra-processor moves
    • All processors consider vertical neighbors (top and bottom rows) and empty them
    • All processors send inter-processor moves, as well as information about the vertical neighbors, and gather at the root process
    • Root process gathers all information about moved particles between processors, and scatters back to all processors
    • All processors receive particles from root processor about inter-process moves and re-bins these inbound particles
    • All processors have new local state of particles and can continue with next simulation step

NOTE: while each processor starts with a copy of all particles, the majority of the view of each processor quickly becomes stale. Processors are only aware of the narrow band of rows it simulates on, as well as its vertical neighbors. All other particles in other bins/rows are inconsistent and do not affect the state of the simulation. Only the root process after gathering particle movements gains a global view of the entire simulation.

MPI Usage

We use the standard MPI message passing interface to communicate between processors. Primarily, our general pattern is to handle a majority of the global reduction and aggregation logic in a master root process, and distribute state to be worked on symmetrically by other processors. We use the MPI datatype PARTICLE, which is defined as six consecutive MPI_DOUBLE, as the serial analog consists of six c doubles representing position, velocity, and accelerator for both the x and y directions.


MPI_Datatype PARTICLE;
MPI_Type_contiguous( 6, MPI_DOUBLE, &PARTICLE );
MPI_Type_commit( &PARTICLE );


Initially, particles are initialized by the root process, since init_particles is not deterministic. The particles are then broadcasted to all other processors using MPI_Bcast(particles, n, PARTICLE, 0, MPI_COMM_WORLD) with the communicator MPI_COMM_WORLD. Once all processors receive the broadcast of all particles, they then place all particles into bins. The initial binning process is deterministic and can be replicated for all processors:


if( rank == 0 )
    init_particles( n, particles );

// broadcast all particles from root and put them into bins
MPI_Bcast(particles, n, PARTICLE, 0, MPI_COMM_WORLD);
putIntoBins(particles, allBins, n);


In the main simulation loop, when calculating the davg, dmin, and navg statistics, it is important to reduce the variables. This is done using the following code. Calls to MPI_Reduce use MPI datatypes and operators and reduce into the root processor 0:


if( find_option( argc, argv, "-no" ) == -1 )
{
  MPI_Reduce(&davg,&rdavg,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  MPI_Reduce(&navg,&rnavg,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
  MPI_Reduce(&dmin,&rdmin,1,MPI_DOUBLE,MPI_MIN,0,MPI_COMM_WORLD);
  if (rank == 0){
    //
    // Computing statistical data
    //
    if (rnavg) {
      absavg +=  rdavg/rnavg;
      nabsavg++;
    }
    if (rdmin < absmin) absmin = rdmin;
  }
}


The last major usage of MPI comes when considering the mechanism by which processes receive particles that move between bins. When particles move between bins, they can either move to a bin belonging to the same processor, or to a bin belonging to a neighboring processor.

Intra-processor movement can be handled locally without MPI. However, Inter-processor movement is trickier. This is especially the case when considering MPI’s inability to deal with varying data sizes. Therefore, when communicating inter-process moved particles, we must first communicate the number of particles that move out of a process, and only then can the receiving processor know the size of data to receive.


int sendSize = toMove.size();
int rcvSize[n_proc];

// master process collects sizes to determine offsets
MPI_Gather(&sendSize, 1, MPI_INT, rcvSize, 1, MPI_INT, 0, MPI_COMM_WORLD);  

The array rcvSize records the number of inter-process moves into the processor represented by the index of the array. In other words, processor i can expect rcvSize[i] inbound inter-process moved particles. Each processor records its own sendSize since only it knows the number of particles from its local simulation that move out of the current processor. We use a call to MPI_Gather to gather all processors’ values of sendSize into the array rcvSize in the root processor.

The goal is to scatter variable information amongst processors, depending on their simulation space, so we have to calculate offsets and sizes to pass into the scatter MPI call:

int offset[n_proc];
int totalSize = 0;
bin_t inboundParticles;

if (rank == 0) {
    offset[0] = 0;
    for (int i = 1; i < n_proc; ++i) {
        offset[i] = offset[i-1] + rcvSize[i-1];
        totalSize += rcvSize[i - 1];
    }
    totalSize += rcvSize[n_proc-1];

    inboundParticles.resize(totalSize);
}

Offsets are defined by the number of particles inbound before it, which is defined by the previous offset plus the previous size: offset[i] = offset[i-1] + rcvSize[i-1]. The counter totalSize exists to track the size of all received particles, which is why resize is called at the end. Also note that this subroutine is only run by the root process, which is rank 0. Afterwards, the root process calls MPI_Gatherv with the calculated arguments:


// after determining offsets, master process collects the particles to be moved
MPI_Gatherv(toMove.data(), sendSize, PARTICLE, 
    inboundParticles.data(), rcvSize, offset, PARTICLE, 
    0, MPI_COMM_WORLD);

After the root process gathers all the particles to be moved between processes, it must scatter that information. We can do this the same way as before, in which we must scatter information about the size before scattering. Before doing this, it’s easy to flatten all particles into a single vector, and specify offset and size as before. After extracting all particles into a vector, we call MPI_Scatter and MPI_Scatterv.


MPI_Scatter(rcvSize, 1, MPI_INT, &sendSize, 1, MPI_INT, 0, MPI_COMM_WORLD);
...
//ready to scatter with offset sizes
MPI_Scatterv(singleVector.data(), rcvSize, offset, PARTICLE, 
    foreignParticles.data(), sendSize, PARTICLE, 0, MPI_COMM_WORLD);

Above, we scatter all the sizes of the particles to be received by each process, and in the subsequent call, we scatter vectors of particles from the root process into all other processors. After these MPI calls, all processors have access in their memory the particles that have moved into their rows (e.g. foreignParticles). They then put these particles into bins and continue onto the next step of simulation. This is possible because all MPI calls to Scatter are blocking.

![normal]((https://rustie.xyz/res/img/normal.png){: .center-image }

After implementing MPI distributed memory algorithm, we achieved simulation times as seen in the figure above. Results were run on Cori using 16 processors. In varying the number of processors, we saw the below speedup. We ran the benchmark with 100,000 particle simulation:

![num_proc_linegraph]((https://rustie.xyz/res/img/num_proc_linegraph.png){: .center-image }

The simulation time generally is inversely proportional to the number of processors allocated to the distributed memory algorithm. There seems to be a point where computation overhead and communication overhead trade dominance near when the number of processors equals 8 and 16. There is a similar occurance early on with smaller number of processors too. Anomolies in this data are explained in the next section.

Considerations

Communication overhead

One key factor in achieving speedup in the particle simulation is to consider the communication overhead in comparison to computational complexity. Our initial assumption was that communication dominates computation. It should be that the more computation each process does before gathering back together in the root process, then the less the communication overhead counts in the overall speedup.

The cost of communication is also considered in regards to the scope of each processor’s simulation. It is assumed to be very costly to broadcast the updated state of all particles to all processors at each time step in the simulation. In our particle simulation implementation, we only broadcast all particles to all processors before the simulation happens. Afterwards, processes are only responsible for simulating and communicating particles from a narrow band of rows, thereby minimizing the communication done between processors.

The consideration of intra and inter-processor particle movement is one of the biggest factors affecting communication overhead, since particles are only communicated when they move between bins owned by different processors. Again, this only affects algorithmic performance when our assumption that communication overhead dominates compute overhead is true.

An optimization we tested to compare communication and computation overhead was to assume all particles on the edge of rows were going to move out of the row. Originally, we had only scattered the particles that we have calculated to move to a bin outside of a processor’s row, to increase computation in order to reduce communication later on.

opt_vs_normal

Conclusion

After testing various MPI implementation methods and schemes, we were able to achieve strong scaling efficiency of 0.26 and weak scaling efficiency of 0.24. Additional efficiency gains can be observed when combining multi processor schema such as MPI with multi threaded schema such as OpenMP as seen in part 1 of our report.

Especially notable in this project was when considering the communication versus computation overheads and plotting the results of simulation, we concluded that for small simulations, computational overhead dominates communication overhead. This could be the result of scheduling and traffic on Cori, as there was unusual difficulty getting jobs scheduled and run recently. This is not a general rule for all systems, and was the result of testing many times on Cori.

Future work will explore further speedup methods such as using GPU to accelerate computation.