 ... ... @@ -816,6 +816,9 @@ $\hline Intel&2 x Intel Xeon CPU E5-2609 v4 & CPU & 1.7 GHz & 63 GB & 32 kB & 16 & 16\tabularnewline \hline Iris 640 & Intel Iris Graphics 640 & GPU & 1.0 GHz & 4 GB & 64 kB & 48 & 192\tabularnewline \hline \end{tabular} \par\end{centering} \caption{Characteristics of the OpenCL devices tested in this paper. CU stands for "Compute Units" and "PE" for "Processing Elements".\label{tab:OpenCL-devices-for}} ... ... @@ -861,20 +864,77 @@$ We remark that thanks to the chosen organization into memory, we do not have to use the local memory for accelerating the algorithm. When the program is run on NVIDIA cards, monitoring tools, such as \texttt{nvtop}\footnote{\url{https://github.com/Syllo/nvtop}} indicates that the GPU occupation is of $99\%$. This indicates a quasi-optimal implementation. In order to measure the efficiency of the implementation we perform a memory bandwidth test on several grid sizes. One time step of the method implies the read access in the global memory to the set of fields of the previous time step. The local computations are done in registers. Then there is another write access to global memory for storing the data of the next time step. The size in memory in Gigabyte of one set of fields is $$n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times}\texttt{Ny}\ensuremath{\times b}}\times4\times m}{1024^{3}} ,$$ where $b$ is the number of bytes for storing one floating point number ($b=4$ for single precision and $b=8$ for double precision), % When the program is run on NVIDIA cards, monitoring tools, such as % \texttt{nvtop}\footnote{\url{https://github.com/Syllo/nvtop}} indicates % that the GPU occupation is of $99\%$. This indicates a quasi-optimal % implementation. \revB{In order to measure the efficiency of the implementation we perform a memory bandwidth test for a $512\times512$ grid. One time-step of the method implies the read access in the global memory of the set of fields of the previous time-step. The local computations are done in registers. Then there is another write access to global memory for storing the data of the next time-step. The memory size in Gigabyte of one set of fields is $n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times\texttt{Ny}}\ensuremath{\times prec}}\times4\times m}{1024^{3}},$ where $prec$ is the number of bytes for storing one floating point number ($prec=4$ for single precision and $prec=8$ for double precision). We then perform a given number of time iterations $n_{\text{iter}}$ and measure the elapsed time $t_{\text{elapsed}}$ in the OpenCL kernels. We perform two kind of experiments. In the first experiment, we deactivate the numerical computations and only perform the shift operations. The memory bandwidth of the shift algorithm is then given by $b=\frac{2\times n_{\text{GB}}\times n_{\text{iter}}}{t_{\text{elapsed}}}.$ In the second experiment, we reactivate the computations and measure how the bandwidth is reduced. This allows to evaluating how the elapsed time is shared between memory transfers and computations. The results are given in Table \ref{tab:bandwidth}. We observe a good efficiency of the shift algorithm in the shift-only case: the transfer rates are not very far from the maximal bandwidth of the device, at least for the GPU accelerators. From this results we also observe that the LBM algorithm is clearly memory bound. When the single precision computations are activated on the GPU devices (GTX, Quadro, V100), the elapsed time of the shift-and-relaxation test is not very different from the shift-only test. For the double precision computations, we observe that the V100 device outperforms all the other GPUs.\\} \begin{table}\revB{ \begin{tabular}{|c|c|c|c|c|} \hline & prec. & $b$ (GB/s, shift-only) & $b$ (GB/s, shift-relax) & max. theoretical $b$ (GB/s)\tabularnewline \hline \hline Intel & float32 & 17.58 & 13.38 & 60\tabularnewline \hline Intel & float64 & 19.12 & 17.48 & 60\tabularnewline \hline Iris 640 & float32 & 26.20 & 24.98 & 34\tabularnewline \hline Iris 640 & float64 & 20.08 & 3.78 & 34\tabularnewline \hline GTX & float32 & 147.54 & 146.94 & 192\tabularnewline \hline GTX & float64 & 148.76 & 49.72 & 192\tabularnewline \hline Quadro & float32 & 336.45 & 329.06 & 432\tabularnewline \hline Quadro & float64 & 344.50 & 127.21 & 432\tabularnewline \hline V100 & float32 & 692.31 & 676.44 & 900\tabularnewline \hline V100 & float64 & 705.88 & 610.17 & 900\tabularnewline \hline \end{tabular} \caption{Bandwidth efficiency of the LBM algorithm. Comparison of the data transfer rates of the shift-only algorithm and of the shift-and-relaxation algorithm. The resulting bandwidth is compared with the maximal memory bandwidth advertised by the vendors of the hardware devices\label{tab:bandwidth}.} } \end{table} \section{Numerical applications to MHD} \subsection{Smooth vortex (performance test)} ... ...
 compute = False
compute = True
print("Compute:")
print(compute)
