Commit 307e3d4d by Philippe Helluy

### up

parent 48dc98ee
 ... ... @@ -417,690 +417,73 @@ I would like the authors to give a more detailed explanation of how they We enriched the computational part of the part with a more precise evaluation of the performance of the code: memory bandwidth and computational intensity. This analysis confirms the excellent efficiency of the implementation. \begin_inset Newline newline \end_inset In order to measure the efficiency of the implementation we perform a memory bandwidth test for a \begin_inset Formula $512\times512$ \end_inset grid. One time-step of the method implies the read access in the global memory of the set of fields of the previous time-step. The local computations are done in registers. Then there is another write access to global memory for storing the data of the next time-step. The memory size in Gigabyte of one set of fields is \begin_inset Formula $n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times\texttt{Ny}}\ensuremath{\times prec}}\times4\times m}{1024^{3}},$ \end_inset This analysis confirms the good efficiency of the implementation. \end_layout where \begin_inset Formula $prec$ \begin_layout Enumerate If my arithmetic is correct, the largest 7000 2 simulation uses more than 10 GB of memory for one set of fields in float64 format (...) \begin_inset Quotes erd \end_inset is the number of bytes for storing one floating point number ( \begin_inset Formula $prec=4$ \end_inset for single precision and \begin_inset Formula $prec=8$ \begin_inset Newline newline \end_inset for double precision). We then perform a given number of time iterations \begin_inset Formula $n_{\text{iter}}$ \end_inset The largest computations were done on an NVIDIA Quadro P6000 with 24 GB of memory. The two sets of fields thus entered the GPU memory. \end_layout and measure the elapsed time \begin_inset Formula $t_{\text{elapsed}}$ \begin_layout Enumerate The authors should say precisely how they computed the discrete divergence and discrete curl of the magnetic field to create these plots. \begin_inset Newline newline \end_inset in the OpenCL kernels. We perform two kind of experiments. In the first experiment, we deactivate the numerical computations and only perform the shift operations. The memory bandwidth of the shift algorithm is then given by This information is now given. \begin_inset Formula $b=\frac{2\times n_{\text{GB}}\times n_{\text{iter}}}{t_{\text{elapsed}}}. \nabla\cdot\mathbf{B}\simeq\frac{B_{1}(x+\Delta x,y)-B_{1}(x-\Delta x,y)}{2\Delta x}+\frac{B_{2}(x,y+\Delta y)-B_{2}(x,y-\Delta y)}{2\Delta y}.$ \end_inset In the second experiment, we reactivate the computations and measure how the bandwidth is reduced. This allows to evaluating how the elapsed time is shared between memory transfers and computations. The results are given in Table \begin_inset CommandInset ref LatexCommand ref reference "tab:bandwidth" plural "false" caps "false" noprefix "false" \end_inset . We observe a good efficiency of the shift algorithm in the shift-only case: the transfer rates are not very far from the maximal bandwidth of the device, at least for the GPU accelerators. From this results we also observe that the LBM algorithm is clearly memory bound. When the single precision computations are activated on the GPU devices (GTX, Quadro, V100), the elapsed time of the shift-and-relaxation test is not very different from the shift-only test. For the double precision computations, we observe that the V100 device outperforms all the other GPUs. \begin_inset Newline newline \end_inset \begin_inset Float table wide false sideways false status open \begin_layout Plain Layout \begin_inset Tabular \begin_inset Text \begin_layout Plain Layout \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout prec. \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $b$ \end_inset (GB/s, only shift) \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $b$ \end_inset (GB/s, full LBM) \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout max. theoretical \begin_inset Formula $b$ \end_inset (GB/s) \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout Intel \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float32 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 17.58 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 13.38 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 60 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout Intel \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float64 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 19.12 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 17.48 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 60 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout Iris 640 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float32 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 26.20 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 24.98 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 34 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout Iris 640 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float64 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 20.08 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 3.78 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 34 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout GTX \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float32 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 147.54 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 146.94 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 192 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout GTX \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float64 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 148.76 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 49.72 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 192 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout Quadro \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float32 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 336.45 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 329.06 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 432 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout Quadro \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float64 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 344.50 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 127.21 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 432 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout V100 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float32 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 692.31 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 676.44 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 900 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout V100 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float64 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 705.88 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 610.17 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 900 \end_layout \end_inset \end_inset \end_layout \begin_layout Plain Layout \begin_inset Caption Standard