Rustie Lin UC Berkeley CS, Distributed Systems, Blockchain

Exploring Parallel Matrix Multiply

A superset analysis of CS267 graduate computer science course on parallel computing. Survey of DGEMM (Double Precision General Matrix Multiplication) speed-up boosts on CORI supercomputer. All results were calculated on a single core, single thread on Intel Xeon E5- 2698 v3 2.30 GHz CPU. Graphs show performance in percentages of maximum CPU utilization versus matrix dimensions. This was the first assignment of its kind that I've done, this being my first graudate level course. I find the course material extremely enjoyable: almost as enjoyable as seeing the code we write run faster and faster.

Naive

Throughout this study, we focused on optimizing percentage of max bandwidth usage with the standard 2n^3 matrix multiplication. Algorithms with lower number of operations done (e.g. different running time complexity) were not considered.

Blocking

We used the provided dgemm-blocked.c matrix multiply with blocking implementation as a starting point. First task was to find the optimal block size through testing.

Intel Xeon E5- 2698 v3 has 32KiB L1 cache per core, and theoretically, calculation of block size is as follows:

Note the usage of 3/4 matrices. We consider the upper and lower bounds of the memory requirements of storing A, B, and C matrices (e.g. in C:= C+A*B). In the upper bound case, the memory requirements of storing the entire matrices dominates the requirements of storing temporary variables and other data (e.g. O(n^2) space complexity vs constant) – so we only need to consider 3 matrices. This is the case for very large matrices. In the lower bound case, we consider when memory to store matrices is comparable to that of temporary variables, such as when the matrices are very small.

The matrix size and other various run-time factors make it such that the theoretical optimal block size is not necessarily the optimal in practice. Therefore, we run a benchmark on various block sizes: starting at 8x8 and incrementing the square dimensions up to 64x64.

Block size on performance

We found that the block size did not make a significant difference in performance; only that the larger block sizes did seem to have a slight improvement in performance. We decided to introduce AVX to see if we could find more significant performance improvement factors.

AVX Intrinsics

Intel Xeon E5- 2698 v3 2.30 GHz supports AVX2 with 256 bit vector width, meaning that we can fit up to 4 double precision floating point numbers in a vector. Initially we tried to vectorize the innermost loop of the blocked matrix multiply kernel. However, that method required a horizontal sum, which is not supported by AVX2.

Instead, we vectorized the second loop so that we could aggregate the sum of doubles in a register over multiple passes.


void do_block_avx(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (i = 0; i < M - M % 4; i+=4) {
    for (int j = 0; j < N; ++j) {
      // Compute C(i,j)
      __m256d c = _mm256_load_pd(C+i+j*lda);
      for (int k = 0; k < K; ++k) {
        c = _mm256_add_pd(
          c, 
          _mm256_mul_pd(
            _mm256_load_pd(A+i+k*lda), 
            _mm256_broadcast_sd(B+k+j*lda)));
      }
      _mm256_store_pd(C+i+j*lda, c);
    }
  }
  ...
}


Because we work on 4 doubles at a time, we need to take into account at the end of a matrix. We seperate our code in “parallel” (SIMD) and “serial” operations depending on whether we can use AVX2 or are forced to use the naive matrix multiply. The above code snippet shows our “parallel” loop, and below is our “serial loop”:


  for ( ; i < M; i++) {
    for (int j = 0; j < N; ++j) {
      // Compute C(i,j)
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k) {
        cij += A[i+k*lda] * B[k+j*lda];
      }
      C[i+j*lda] = cij;
    }
  }

We saw that block sizes of powers of two provided the best performance, with either 32 or 64 block size yielding the highest percentage CPU utilization. This makes sense because when chopping up the original matrices into blocks, any block size that was not a power of two or at least a multiple of 4 would not be able to be transformed properly into AVX intrinsics – there would be leftover serial operations on each of the blocks that could not be parallelized.

Loop Unrolling

In attempt to decrease the effects of branching, we unroll some of our loops. We combined this technique with AVX intrinsics:


void do_block_avx_unrolled(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (int j = 0; j < N; ++j) {
    for (i = 0; i < M - M % (UNROLL*4); i+=UNROLL*4) {
      // Compute C(i,j)
      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }
      for (int k = 0; k < K; ++k) {
        __m256d b = _mm256_broadcast_sd(B+k+j*lda);
        for (int x = 0; x < UNROLL; x++) {
          c[x] = _mm256_add_pd(
            c[x], 
            _mm256_mul_pd(
              b,
              _mm256_load_pd(A+lda*k+x*4+i)));
        }
      }
      for (int x = 0; x < UNROLL; x++) {
        _mm256_store_pd(C+i+x*4+j*lda, c[x]);
      }
    }
  }
  ...
}

Results from changing the block size were then much more apparent, as shown in the figure below.

block size on performance with AVX and loop unrolling

Performance for block sizes 32 and 64 seemed similar across tests, so we decided to pick the larger 64 so as to have less blocks as the matrix sizes increased in testing.

We tested various different valeus for unrolling, and in general noted that having an UNROLL value of a power of 2 maximizes our CPU usage. This makes sense since that enables AVX to fully take advantage of the most operations at once, since AVX2 here can operate on four double precision floating point numbers at once. Unrolling 8 at a time proved optimal in our case, but only for matrices up until around 500x500 size. Unroll sizes other than 4 and 8 gave very poor results.

Unroll size on performance with AVX and blocking

Fixed Multiply Add

Testing FMA (Fixed Multiply Add) intrinsics yielded worse results than with no FMA.


void do_block_avx_unrolled(int lda, int M, int N, int K, double* A, double* B, double* C)
{
  int i;
  for (int j = 0; j < N; ++j) {
    for (i = 0; i < M - M % (UNROLL*4); i+=UNROLL*4) {
      // Compute C(i,j)
      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }
      for (int k = 0; k < K; ++k) {
        __m256d b = _mm256_broadcast_sd(B+k+j*lda);
        for (int x = 0; x < UNROLL; x++) {
          // c[x] = _mm256_add_pd(
          //   c[x], 
          //   _mm256_mul_pd(
          //     b,
          //     _mm256_load_pd(A+lda*k+x*4+i)));

          c[x] = _mm256_fmadd_pd(b, _mm256_load_pd(A+lda*k+x*4+i), c[x]);
        }
      }
      for (int x = 0; x < UNROLL; x++) {
        _mm256_store_pd(C+i+x*4+j*lda, c[x]);
      }
    }
  }
  ...
}

Choice of Compiler

Cori supports compiler modules for Intel and GNU standard compilers. We ran tests of standardized block size of 64 and unroll size of 8, against choice of ICC (Intel C Compiler) and GCC (GNU C Compiler). Results are shown below

ICC vs GCC

Compiler Options

-O2 Optimize Option

-O2 compiler optimize option output is shown below. It optimizes various loops in our code in dgemm-blocked.



Analyzing loop at dgemm-blocked.c:154

Analyzing loop at dgemm-blocked.c:156

Analyzing loop at dgemm-blocked.c:159

Analyzing loop at dgemm-blocked.c:109

Analyzing loop at dgemm-blocked.c:110

Analyzing loop at dgemm-blocked.c:138

Analyzing loop at dgemm-blocked.c:126

Analyzing loop at dgemm-blocked.c:128

Analyzing loop at dgemm-blocked.c:113

Analyzing loop at dgemm-blocked.c:362

Analyzing loop at dgemm-blocked.c:363

Analyzing loop at dgemm-blocked.c:366


Shown in the figure below is a comparison of GCC and ICC using the -O2 optimize option flag.

O2 GCC v ICC

-O3 Optimize Option



Analyzing loop at dgemm-blocked.c:154

Analyzing loop at dgemm-blocked.c:156

Analyzing loop at dgemm-blocked.c:159

Analyzing loop at dgemm-blocked.c:109

Analyzing loop at dgemm-blocked.c:110

Analyzing loop at dgemm-blocked.c:126

Analyzing loop at dgemm-blocked.c:362

Analyzing loop at dgemm-blocked.c:363

Analyzing loop at dgemm-blocked.c:366


We found that -O2 compiler flag analyzed more code than the -O3 compiler flag.

O3 GCC v ICC

We found that though -O2 analyzed more code than -O3, the resulting performance was very similar, and that differences were negligible. We choice to use -O2 and carried forward with other compiler options.

O2 v O3

-Ofast

However, comparing performance of -O2 and -O3 versus another compiler optimizer option -Ofast, the differences were more pronounced, and thus justifying our choice of -O2.

O2 v O3 v Ofast

Other options

After various rounds of testing, it seemed that the flags -ftree-vec-info-optimized fopt-info-vec-optimized -mavx2 provided the best results.

Compiler vectorization

Compiler options -qopt-report=1 -qopt-report-phase=vec reveals the compiler’s automatic vectorization operations. Output for our dgemm-blocked.c file is shown below:



Begin optimization report for: do_block_avx_unrolled(int, int, int, int, double *, double *, double *)

    Report from: Vector optimizations [vec]


LOOP BEGIN at dgemm-blocked.c(109,3)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at dgemm-blocked.c(110,5)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at dgemm-blocked.c(126,7)
         remark #25460: No loop optimizations reported

         LOOP BEGIN at dgemm-blocked.c(128,9)
         LOOP END
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(138,7)
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(113,7)
      LOOP END
   LOOP END
LOOP END

LOOP BEGIN at dgemm-blocked.c(154,3)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at dgemm-blocked.c(156,5)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at dgemm-blocked.c(159,7)
      <Peeled loop for vectorization>
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(159,7)
         remark #15300: LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at dgemm-blocked.c(159,7)
      <Remainder loop for vectorization>
      LOOP END
   LOOP END
LOOP END


The output revealed that no vectorization was done for our inner matrix multiply function, meaning that our code was already optimally vectorized.

Note

Transpose

The overhead of transposing the entire matrices before performing our standard matrix multiply was too great. Transposing and multiplying a row of A vs a column of B also brought horizontal sum back into consideration, and since that operation is high overhead, we did not see good results from transposing.

Loop Reordering

Considering loop ordering, it is important to order the standard three nested loops in such a way as to minimize cache misses. Seeing as the innermost operations depend on addition and multiplication of values to calculate new indices upon which to operate, it was clear that indices of each matrix should not change too much at a time. In othe words, we take advantage of spatial locality. In the end, we noted that loop reordering does not increase our score nearly as much as the following techniques. The reason is especially due to consideration of AVX intrinsics and loop unrolling – techniques which increase the amount of work per memory access.

Conditionals

We noticed that performance of matrix multiply with our AVX2, blocking, loop unrolling, etc. optimizations varied depending on the size of the input matrices. Particularly, matrix multiply with matrices smaller than 32x32 yielded very poor results. (e.g. size 31 was often near 10x slower than 32x32).

However, as with the matrix transpose, we found that the overhead of having a conditional to check the input matrix size was too great to provide any benefit. This is due to CPU branch predition. Perhaps this is a non-issue with very large matrices when the overhead is dominated by any benefits of updated settings.

Loop Unrolling Alternatives

Alternatively to loop unrolling with another for loop, we pasted the loop unrolled code 8 times:


      __m256d c[UNROLL];
      for (int x = 0; x < UNROLL; x++) {
        c[x] = _mm256_load_pd(C+i+x*4+j*lda);
      }

      // VS

      c[0] = _mm256_load_pd(C+i+0*4+j*lda);
      c[1] = _mm256_load_pd(C+i+1*4+j*lda);
      c[2] = _mm256_load_pd(C+i+2*4+j*lda);
      c[3] = _mm256_load_pd(C+i+3*4+j*lda);
      c[4] = _mm256_load_pd(C+i+4*4+j*lda);
      c[5] = _mm256_load_pd(C+i+5*4+j*lda);
      c[6] = _mm256_load_pd(C+i+6*4+j*lda);
      c[7] = _mm256_load_pd(C+i+7*4+j*lda);


This gave a worse performance result.

After all these tests, we continued to experiment with different block sizes. We were exploring different ways of blocking and found an optimal way through multiple trial and error. We then tried to unroll all A, B and C matrices manually. It gave a slight increase in performance. Then we realised that we could further improve the performance if we can replace the final residual calculation part (If the matrix size is not a multiple of 4, we have extra residue)

for (int j = 0; j < N; ++j) {
    i = ii;
    for ( ; i < M; i++) {
      // Compute C(i,j)
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k) {
        cij += A[i+k*lda] * B[k+j*lda];
      }
      C[i+j*lda] = cij;
    }
  }
  

Hence we added some extra “safe” space to the matrix to make it a multiple of 8. After reordering the matrix to be a multiple of 8, we ran the same calculations as before. This gave a significant performance boost. After calculation, we simply reordered to remove the extra space from the result matrix.

  add_extra_space(A, origA, lda, extra_space);
  add_extra_space(B, origB, lda, extra_space);
  add_extra_space(C, origC, lda, extra_space);

Final results

We believe that if we can avoid duplication of matrices, we can get a slightly higher performance. The results can be seen below.

The above was our original that gave optimal performance. Below pasting it 8 times yielded worse performance. We hypothesize that is because of increased space requirements, showing that fundamentally loop unrolling is a space vs time tradeoff. Already having the for loop to unroll already optimized the benefit of loop unrolling as understood by the compiler.

Conclusion

We were able to optimize DGEMM performance on a single core, single thread on Intel Xeon E5- 2698 v3 2.30 GHz CPU to an average of 35% on Cori. Various strategies including loop unrolling, blocking, using AVX intrinsics, and compiler flags were explored in depth.

Optimal performance with our code can be reached using the following compiler options:

cc -O2 -funroll-loops -march=core-avx2 dgemm-blocked.c

Variation in testing environments could have had an effect on our test results. Outliers in data were observed when Cori was under load by other users. Also, the code was tested outside of Cori on various other hardware including Intel i5-7360U CPU @ 2.30GHz on 2017 MacBook Pro and achieved a surprising performance average of 43% across our benchmarks. This was probably due to lack of intese compute on our local computers.

This final figure shows our final max performance (built off of the square_dgemm function in dgemm-blocked.c) against the provided “naive blocking” function in the provided code before block size tuning.

Final results

References

  1. https://sites.google.com/lbl.gov/cs267-spr2019/hw-1
  2. http://www.nersc.gov/users/computational-systems/cori/
  3. http://www.nersc.gov/users/computational-systems/edison/programming/vectorization/
  4. http://www.nersc.gov/users/computational-systems/edison/configuration/
  5. https://gcc.gnu.org/onlinedocs/
  6. https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html
  7. https://software.intel.com/en-us/isa-extensions
  8. http://theeirons.org/Nadav/pubs/MatrixMult.pdf
  9. http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100