Same ramblings and ruminations in the moment.

Exploring Parallel Matrix Multiply

A superset analysis of CS267 graduate computer science course on parallel computing. Survey of DGEMM (Double Precision General Matrix Multiplication) speed-up boosts on CORI supercomputer. All results were calculated on a single core, single thread on Intel Xeon E5- 2698 v3 2.30 GHz CPU. Graphs show performance in percentages of maximum CPU utilization versus matrix dimensions. This was the first assignment of its kind that I've done, this being my first graudate level course. I find the course material extremely enjoyable: almost as enjoyable as seeing the code we write run faster and faster.

Naive

Throughout this study, we focused on optimizing percentage of max bandwidth usage with the standard 2n^3 matrix multiplication. Algorithms with lower number of operations done (e.g. different running time complexity) were not considered.

Blocking

We used the provided dgemm-blocked.c matrix multiply with blocking implementation as a starting point. First task was to find the optimal block size through testing.

Intel Xeon E5- 2698 v3 has 32KiB L1 cache per core, and theoretically, calculation of block size is as follows:

\[32~\text{KiB} \cdot \frac{1}{\{3,4\}~\text{matrices}} \cdot \frac{\text{double}}{64~\text{B}}\]

Note the usage of 3/4 matrices. We consider the upper and lower bounds of the memory requirements of storing A, B, and C matrices (e.g. in C:= C+A*B). In the upper bound case, the memory requirements of storing the entire matrices dominates the requirements of storing temporary variables and other data (e.g. O(n^2) space complexity vs constant) – so we only need to consider 3 matrices. This is the case for very large matrices. In the lower bound case, we consider when memory to store matrices is comparable to that of temporary variables, such as when the matrices are very small.

The matrix size and other various run-time factors make it such that the theoretical optimal block size is not necessarily the optimal in practice. Therefore, we run a benchmark on various block sizes: starting at 8x8 and incrementing the square dimensions up to 64x64.

Block size on performance

We found that the block size did not make a significant difference in performance; only that the larger block sizes did seem to have a slight improvement in performance. We decided to introduce AVX to see if we could find more significant performance improvement factors.

AVX Intrinsics

Intel Xeon E5- 2698 v3 2.30 GHz supports AVX2 with 256 bit vector width, meaning that we can fit up to 4 double precision floating point numbers in a vector. Initially we tried to vectorize the innermost loop of the blocked matrix multiply kernel. However, that method required a horizontal sum, which is not supported by AVX2.

Instead, we vectorized the second loop so that we could aggregate the sum of doubles in a register over multiple passes.


void do_block_avx(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (i = 0; i < M - M % 4; i+=4) {
    for (int j = 0; j < N; ++j) {
      // Compute C(i,j)
      __m256d c = _mm256_load_pd(C+i+j*lda);
      for (int k = 0; k < K; ++k) {
        c = _mm256_add_pd(
          c, 
          _mm256_mul_pd(
            _mm256_load_pd(A+i+k*lda), 
            _mm256_broadcast_sd(B+k+j*lda)));
      }
      _mm256_store_pd(C+i+j*lda, c);
    }
  }
  ...
}


Because we work on 4 doubles at a time, we need to take into account at the end of a matrix. We seperate our code in “parallel” (SIMD) and “serial” operations depending on whether we can use AVX2 or are forced to use the naive matrix multiply. The above code snippet shows our “parallel” loop, and below is our “serial loop”:


  for ( ; i < M; i++) {
    for (int j = 0; j < N; ++j) {
      // Compute C(i,j)
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k) {
        cij += A[i+k*lda] * B[k+j*lda];
      }
      C[i+j*lda] = cij;
    }
  }

We saw that block sizes of powers of two provided the best performance, with either 32 or 64 block size yielding the highest percentage CPU utilization. This makes sense because when chopping up the original matrices into blocks, any block size that was not a power of two or at least a multiple of 4 would not be able to be transformed properly into AVX intrinsics – there would be leftover serial operations on each of the blocks that could not be parallelized.

Loop Unrolling

In attempt to decrease the effects of branching, we unroll some of our loops. We combined this technique with AVX intrinsics:


void do_block_avx_unrolled(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (int j = 0; j < N; ++j) {
    for (i = 0; i < M - M % (UNROLL*4); i+=UNROLL*4) {
      // Compute C(i,j)
      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }
      for (int k = 0; k < K; ++k) {
        __m256d b = _mm256_broadcast_sd(B+k+j*lda);
        for (int x = 0; x < UNROLL; x++) {
          c[x] = _mm256_add_pd(
            c[x], 
            _mm256_mul_pd(
              b,
              _mm256_load_pd(A+lda*k+x*4+i)));
        }
      }
      for (int x = 0; x < UNROLL; x++) {
        _mm256_store_pd(C+i+x*4+j*lda, c[x]);
      }
    }
  }
  ...
}

Results from changing the block size were then much more apparent, as shown in the figure below.

block size on performance with AVX and loop unrolling

Performance for block sizes 32 and 64 seemed similar across tests, so we decided to pick the larger 64 so as to have less blocks as the matrix sizes increased in testing.

We tested various different valeus for unrolling, and in general noted that having an UNROLL value of a power of 2 maximizes our CPU usage. This makes sense since that enables AVX to fully take advantage of the most operations at once, since AVX2 here can operate on four double precision floating point numbers at once. Unrolling 8 at a time proved optimal in our case, but only for matrices up until around 500x500 size. Unroll sizes other than 4 and 8 gave very poor results.

Unroll size on performance with AVX and blocking

Fixed Multiply Add

Testing FMA (Fixed Multiply Add) intrinsics yielded worse results than with no FMA.


void do_block_avx_unrolled(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (int j = 0; j < N; ++j) {
    for (i = 0; i < M - M % (UNROLL*4); i+=UNROLL*4) {
      // Compute C(i,j)
      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }
      for (int k = 0; k < K; ++k) {
        __m256d b = _mm256_broadcast_sd(B+k+j*lda);
        for (int x = 0; x < UNROLL; x++) {
          // c[x] = _mm256_add_pd(
          //   c[x], 
          //   _mm256_mul_pd(
          //     b,
          //     _mm256_load_pd(A+lda*k+x*4+i)));

          c[x] = _mm256_fmadd_pd(b, _mm256_load_pd(A+lda*k+x*4+i), c[x]);
        }
      }
      for (int x = 0; x < UNROLL; x++) {
        _mm256_store_pd(C+i+x*4+j*lda, c[x]);
      }
    }
  }
  ...
}

Choice of Compiler

Cori supports compiler modules for Intel and GNU standard compilers. We ran tests of standardized block size of 64 and unroll size of 8, against choice of ICC (Intel C Compiler) and GCC (GNU C Compiler). Results are shown below

ICC vs GCC

Compiler Options

-O2 Optimize Option

-O2 compiler optimize option output is shown below. It optimizes various loops in our code in dgemm-blocked.



Analyzing loop at dgemm-blocked.c:154

Analyzing loop at dgemm-blocked.c:156

Analyzing loop at dgemm-blocked.c:159

Analyzing loop at dgemm-blocked.c:109

Analyzing loop at dgemm-blocked.c:110

Analyzing loop at dgemm-blocked.c:138

Analyzing loop at dgemm-blocked.c:126

Analyzing loop at dgemm-blocked.c:128

Analyzing loop at dgemm-blocked.c:113

Analyzing loop at dgemm-blocked.c:362

Analyzing loop at dgemm-blocked.c:363

Analyzing loop at dgemm-blocked.c:366


Shown in the figure below is a comparison of GCC and ICC using the -O2 optimize option flag.

O2 GCC v ICC

-O3 Optimize Option



Analyzing loop at dgemm-blocked.c:154

Analyzing loop at dgemm-blocked.c:156

Analyzing loop at dgemm-blocked.c:159

Analyzing loop at dgemm-blocked.c:109

Analyzing loop at dgemm-blocked.c:110

Analyzing loop at dgemm-blocked.c:126

Analyzing loop at dgemm-blocked.c:362

Analyzing loop at dgemm-blocked.c:363

Analyzing loop at dgemm-blocked.c:366


We found that -O2 compiler flag analyzed more code than the -O3 compiler flag.

O3 GCC v ICC

We found that though -O2 analyzed more code than -O3, the resulting performance was very similar, and that differences were negligible. We choice to use -O2 and carried forward with other compiler options.

O2 v O3

-Ofast

However, comparing performance of -O2 and -O3 versus another compiler optimizer option -Ofast, the differences were more pronounced, and thus justifying our choice of -O2.

O2 v O3 v Ofast

Other options

After various rounds of testing, it seemed that the flags -ftree-vec-info-optimized fopt-info-vec-optimized -mavx2 provided the best results.

Compiler vectorization

Compiler options -qopt-report=1 -qopt-report-phase=vec reveals the compiler’s automatic vectorization operations. Output for our dgemm-blocked.c file is shown below:



Begin optimization report for: do_block_avx_unrolled(int, int, int, int, double *, double *, double *)

    Report from: Vector optimizations [vec]


LOOP BEGIN at dgemm-blocked.c(109,3)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at dgemm-blocked.c(110,5)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at dgemm-blocked.c(126,7)
         remark #25460: No loop optimizations reported

         LOOP BEGIN at dgemm-blocked.c(128,9)
         LOOP END
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(138,7)
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(113,7)
      LOOP END
   LOOP END
LOOP END

LOOP BEGIN at dgemm-blocked.c(154,3)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at dgemm-blocked.c(156,5)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at dgemm-blocked.c(159,7)
      <Peeled loop for vectorization>
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(159,7)
         remark #15300: LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(159,7)
      <Remainder loop for vectorization>
      LOOP END
   LOOP END
LOOP END


The output revealed that no vectorization was done for our inner matrix multiply function, meaning that our code was already optimally vectorized.

Note

Transpose

The overhead of transposing the entire matrices before performing our standard matrix multiply was too great. Transposing and multiplying a row of A vs a column of B also brought horizontal sum back into consideration, and since that operation is high overhead, we did not see good results from transposing.

Loop Reordering

Considering loop ordering, it is important to order the standard three nested loops in such a way as to minimize cache misses. Seeing as the innermost operations depend on addition and multiplication of values to calculate new indices upon which to operate, it was clear that indices of each matrix should not change too much at a time. In othe words, we take advantage of spatial locality. In the end, we noted that loop reordering does not increase our score nearly as much as the following techniques. The reason is especially due to consideration of AVX intrinsics and loop unrolling – techniques which increase the amount of work per memory access.

Conditionals

We noticed that performance of matrix multiply with our AVX2, blocking, loop unrolling, etc. optimizations varied depending on the size of the input matrices. Particularly, matrix multiply with matrices smaller than 32x32 yielded very poor results. (e.g. size 31 was often near 10x slower than 32x32).

However, as with the matrix transpose, we found that the overhead of having a conditional to check the input matrix size was too great to provide any benefit. This is due to CPU branch predition. Perhaps this is a non-issue with very large matrices when the overhead is dominated by any benefits of updated settings.

Loop Unrolling Alternatives

Alternatively to loop unrolling with another for loop, we pasted the loop unrolled code 8 times:


      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }

      // VS

      c[0] = _mm256_load_pd(C+i+0*4+j*lda);
      c[1] = _mm256_load_pd(C+i+1*4+j*lda);
      c[2] = _mm256_load_pd(C+i+2*4+j*lda);
      c[3] = _mm256_load_pd(C+i+3*4+j*lda);
      c[4] = _mm256_load_pd(C+i+4*4+j*lda);
      c[5] = _mm256_load_pd(C+i+5*4+j*lda);
      c[6] = _mm256_load_pd(C+i+6*4+j*lda);
      c[7] = _mm256_load_pd(C+i+7*4+j*lda);


This gave a worse performance result.

After all these tests, we continued to experiment with different block sizes. We were exploring different ways of blocking and found an optimal way through multiple trial and error. We then tried to unroll all A, B and C matrices manually. It gave a slight increase in performance. Then we realised that we could further improve the performance if we can replace the final residual calculation part (If the matrix size is not a multiple of 4, we have extra residue)

for (int j = 0; j < N; ++j) {
    i = ii;
    for ( ; i < M; i++) {
      // Compute C(i,j)
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k) {
        cij += A[i+k*lda] * B[k+j*lda];
      }
      C[i+j*lda] = cij;
    }
  }
  

Hence we added some extra “safe” space to the matrix to make it a multiple of 8. After reordering the matrix to be a multiple of 8, we ran the same calculations as before. This gave a significant performance boost. After calculation, we simply reordered to remove the extra space from the result matrix.

  add_extra_space(A, origA, lda, extra_space);
  add_extra_space(B, origB, lda, extra_space);
  add_extra_space(C, origC, lda, extra_space);

Final results

We believe that if we can avoid duplication of matrices, we can get a slightly higher performance. The results can be seen below.

The above was our original that gave optimal performance. Below pasting it 8 times yielded worse performance. We hypothesize that is because of increased space requirements, showing that fundamentally loop unrolling is a space vs time tradeoff. Already having the for loop to unroll already optimized the benefit of loop unrolling as understood by the compiler.

Conclusion

We were able to optimize DGEMM performance on a single core, single thread on Intel Xeon E5- 2698 v3 2.30 GHz CPU to an average of 35% on Cori. Various strategies including loop unrolling, blocking, using AVX intrinsics, and compiler flags were explored in depth.

Optimal performance with our code can be reached using the following compiler options:

cc -O2 -funroll-loops -march=core-avx2 dgemm-blocked.c

Variation in testing environments could have had an effect on our test results. Outliers in data were observed when Cori was under load by other users. Also, the code was tested outside of Cori on various other hardware including Intel i5-7360U CPU @ 2.30GHz on 2017 MacBook Pro and achieved a surprising performance average of 43% across our benchmarks. This was probably due to lack of intese compute on our local computers.

This final figure shows our final max performance (built off of the square_dgemm function in dgemm-blocked.c) against the provided “naive blocking” function in the provided code before block size tuning.

Final results

References

  1. https://sites.google.com/lbl.gov/cs267-spr2019/hw-1
  2. http://www.nersc.gov/users/computational-systems/cori/
  3. http://www.nersc.gov/users/computational-systems/edison/programming/vectorization/
  4. http://www.nersc.gov/users/computational-systems/edison/configuration/
  5. https://gcc.gnu.org/onlinedocs/
  6. https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html
  7. https://software.intel.com/en-us/isa-extensions
  8. http://theeirons.org/Nadav/pubs/MatrixMult.pdf
  9. http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100

Updates & Index of Knowledge

I’ve been working on an interactive blockchain education depdency graph. Historically has been called an “Index of Knowledge.” Features now:

  • Highlight all dependencies
  • Cycle detection
  • Only has modules from edX courses

Goal would be to have more granularity and breadth of topics, and for each node, contain:

  • Pop up textbox with brief description (from textbook)
  • Curated links to learn more

More to come following course redesign.

Here’s the link.

Making Blockchain Accessible

More blog posts available on the Blockchain at Berkeley Medium blog. The Blockchain at Berkeley edX initiative is also featured in Gloria's blog post here.

Education should be accessible. This should especially be the case in the blockchain space, which takes pride in its core value of decentralization. There’s nowhere saying that the decentralization that we value in the blockchain space is only limited to product or community. Why not extend this to blockchain education as well?

Since Fall 2016, Blockchain at Berkeley has been offering a survey course on the fundamental concepts of cryptocurrencies and blockchain. You can read more about that on the course website. Each of the lectures we taught were carefully designed to fit into the entire top down narrative of the blockchain space we were trying to weave with each iteration. The goal was to create a survey of the blockchain space that was accessible to anyone, no matter their background.

Our core curriculum, Blockchain Fundamentals, is split into two main “epochs,” which roughly models the gradual maturing and general sentiment about cryptocurrency and then later blockchain technologies. Our narrative starts with Bitcoin and Cryptocurrencies, explaining cryptocurrencies as the first use case for blockchain, and Bitcoin as the original inspiration. Our aim for this first epoch is to explore both the technological and social aspects of Bitcoin, and then introduce Ethereum towards the end, so as to dramatically reveal to students the amount of intuition that carries over between understanding different blockchain platforms at a high level. While the name of the epoch is “Bitcoin and Cryptocurrencies,” the primary motivation outside of explaining what the title says is to decouple the ideas of Bitcoin and cryptocurrencies from that of blockchain.

With the realization that their intuition carries from Bitcoin, a cryptocurrency, to Ethereum, a more generalized blockchain platform, students then transition into the curriculum’s second epoch, Blockchain Technology. The aim for this second epoch is to further expand the student’s mental model of what blockchain is. At this point, students are confident in their understanding of cryptocurrencies and blockchain, so in this epoch, we start by diving into related technical topics. The first lecture in the second epoch explains that blockchain thinking has origins earlier than that of 2008’s Bitcoin whitepaper. At its core, blockchain and distributed ledger technologies of the modern day are based in historical academic research in distributed systems and consensus, and that’s where we take students. Consensus has existed for millennia, at least since humans started interacting socially, and the formal academic study of consensus in computer science and mathematics started decades ago, at least by when individuals in the aerospace industry decided they wanted to subject computers to adverse environments. From there, we explore how traditional literature has affected newer forms of distributed consensus. Having understood those fundamentals, we then invite students to seek understanding from a higher abstraction level – from that of (crypto)economics. This then ties naturally into various classifications of blockchains based on their access level (e.g. enterprise blockchains) and the various challenges to adoption they currently face, such as scalability and anonymity.

We end our main curriculum there. Having surveyed much of the blockchain space, we have presented enough information for students to have an intelligent conversation about blockchain. What the space needs is to facilitate more intelligent conversation. Regrettably, talent in the space is currently very dilute. With more accessible blockchain education, we hope to nurture the space into one that is revered by academia, industry, and the general public.

Thoughts on Time

Note: this was written very late at night and with no revision, so this is pretty much my pure thought process, very quickly and roughly sketched into markdown.

I began the summer eager to start coding again after having spent the last year at school taking only courses on computer science theory. My intended focus was blockchain development, and I wanted to build some projects with Cosmos and Tendermint.

However, after some contemplation, I decided that it would be more impactful to approach blockchain development from an academic standpoint. This was especially the case after watching a particular talk by Leslie Lamport titled “Thinking Above the Code.” Being in a university just one or two hops away from literal distributed systems gods, I changed my focus to learning about distributed systems from the very beginning – a full academically-grounded survey.

I’ve always appreciated distributed systems; my favorite lecture to teach for the Blockchain Fundamentals DeCal was our lecture on alternative consensus. It’s one of our more technical lectures: starting from formal definitions of distributed systems and their properties and fundamental tradeoffs, we extend that knowledge into the blockchain space. My personal flair that I’ve added when I taught it was to wrap it in the narrative of first explaining what it means to come to be distributed, and to have consensus (socially), the origins of distributed systems, and the historical context for its study.

My study this summer began with a history lesson on the specific problems and research labs associated with the birth of distributed systems. The works of Leslie Lamport and co and that leading up to SIFT was where I started.

The formal study of distributed systems in computer science began roughly around the time the aerospace industry decided they wanted to put computers on aircraft. Computers could help with data collection about fuel levels, altitude, speed, etc. and later, advancements in digital avionics would lead to innovations in autopilot and fly-by-wire technologies. The fundamental challenge faced here though is that of computing in extremely harsh conditions, namely in the face of heat and cold and of potential solar radiation. There was a need to distribute the points of failure across multiple computers, just in case one failed in execution. Coordination of these separate computers proved to be a challenge though.

Through the lens of a computer scientist, a fundamental challenge to face here is that of event ordering. If we have multiple computers on an airplane logging data from various sensors onboard, we need to maintain a consistently ordered log across all machines.

Amongst other early distributed systems papers, “Time, Clocks, and the Ordering of Events in a Distributed System”, written by Leslie Lamport, stood out to me for being incredibly intuitive. In it, Lamport explains what it means for an event to precede – or happen before – another. A majority of the paper is spent formalizing the definition of \(a \leftarrow b\), read as \(a\) happened before \(b\).

The key insight is that at the most basic level, \(a \leftarrow b\) if either \(a\) and \(b\) occur in the same process and \(a\) happened at an earlier time than \(b\), or \(a\) and \(b\) occur in different processes and \(a\) is the sending of a message and \(b\) is the receiving of a message. the \(\leftarrow\) operator is also composable.

res/img/20180816123412952-atom-img-paste.png

Above, \(p_1\) and \(q_1\) are concurrent. \(p_1 \leftarrow q_7\) and \(q_1 \leftarrow q_7\)

Machines may see events in different physical order, but they must agree on a partial ordering of events such that consensus can occur. In his paper, Lamport explains that this is not a problem of computer science but that of physics, and explains the ordering of events and associated phenomena as parallel to that of special relativity.

With a very limited understanding of special relativity, trying to piece together this meaning was a challenge for me, but I believe I’ve built up a very solid intuition for this. In special relativity, different observers may see events happening in different order. Same goes for distributed systems. The slight difference is that in special relativity, we consider all events that could potentially happen when constructing the partial ordering, whereas in distributed systems, we only consider the events that have actually occurred.

In that case then, it only makes sense to order events that have a causal relation. I believe this to have a relation to that of the age old “If a tree falls in the woods with no sound, did it actually fall”-type questions. You as the observer may have existed at the same time the event of the tree falling supposedly occurred, but since there’s no causal relationship between you existing and the tree existing (standing/falling), whatever actually happened doesn’t matter to you. My understanding is that in physics, we consider the tree to be both standing and falling – in a superposition of all its possible states – since we have not observed it. Upon observation, there is a transfer of information, and so the superposition collapses into a single event – the tree having fallen or not.

We can see the same in distributed systems as well, and the fact that these two – special relativity and distributed systems – are so analogous represents to me the generality the concept of time and the fundamentality of the idea of modeling a system in a distributed fashion. Say that Alice and Bob are friends that don’t get to see each other that often in person – perhaps they live in different cities. On a particularly nice day, Alice calls Bob to tell him that she plans on baking a cake the day after. Bob acknowledges, and the day ends. Two days pass, and Bob hasn’t heard from Alice. So Bob wonders – has Alice baked the cake? The event of Bob wondering about Alice’s cake and Alice’s baking/not baking of the cake are concurrent. In physics, we might say that Alice both baked and didn’t bake the cake, but in distributed systems talk, we just simply don’t know. (Again, in distributed systems we care about the events that actually happen, as opposed to the set of all events that could possibly happen.) Only after Alice texts Bob an image of the amazing cake that she baked does Bob gain context and realize that Alice did in fact bake the cake.

This then got me thinking about concurrent processes in general. One of my good friends studies neurobiology, and explained to me how the highly parallel human brain handles concurrency. I’ve also stumbled across academic papers applying \(\pi\)-calculus and ambient calculus (calculi of concurrent processes traditionally used for formal modeling of distributed systems) to intracellular routines (amongst organelles and RNA transfer.)

Time is a natural thing, and our understanding of distributed systems comes from applying insight from the physical world into the logical world. I’ve come to appreciate the art of reasoning about such scenarios at a fundamental level. Understanding distributed systems formalisms (and elegance), I plan on re-evaluating blockchain from the academic eye.

Blockchain Fundamentals live on edX

After a grueling semester of balancing heavy courseload, side projects, and working to bring the world’s first university-accredited undergraduate course on blockchain technology to the global community, I’m happy to announce that Blockchain Fundamentals is now available on edX for anyone to take for free!

Currently 10,100+ self-motivated students from around the world have signed up for our first course, which explains the key innovations of blockchain, through it’s first use-case: Bitcoin.

Huge shoutout to my team for staying dedicated through the entire semester <3