### up

parent f92a3699
 ... ... @@ -418,6 +418,21 @@ I would like the authors to give a more detailed explanation of how they We enriched the computational part of the part with a more precise evaluation of the performance of the code: memory bandwidth and computational intensity. This analysis confirms the excellent efficiency of the implementation. \begin_inset Newline newline \end_inset In order to measure the efficiency of the implementation we perform a memory bandwidth test for a \begin_inset Formula $512\times512$ \end_inset grid. One time-step of the method implies the read access in the global memory of the set of fields of the previous time-step. The local computations are done in registers. Then there is another write access to global memory for storing the data of the next time-step. The memory size in Gigabyte of one set of fields is \begin_inset Formula $n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times\texttt{Ny}}\ensuremath{\times prec}}\times4\times m}{1024^{3}}, ... ... @@ -438,19 +453,22 @@ where \end_inset for double precision). We then perform a given number of time iterations niter and measure the elapsed time We then perform a given number of time iterations \begin_inset Formula n_{\text{iter}} \end_inset and measure the elapsed time \begin_inset Formula t_{\text{elapsed}} \end_inset (with a specific features of the OpenCL library). in the OpenCL kernels. We perform two kind of experiments. In the first experiment, we deactivate the numerical computations and only perform the shift operations. The memory bandwidth of the shift algorithm is then given by \begin_inset Formula \[ b=\frac{2\times n_{\text{GB}}\times niter}{t_{\text{elapsed}}}. b=\frac{2\times n_{\text{GB}}\times n_{\text{iter}}}{t_{\text{elapsed}}}.$ \end_inset ... ... @@ -459,13 +477,39 @@ In the second experiment, we reactivate the computations and measure how the bandwidth is reduced. This allows to evaluating how the elapsed time is shared between memory transfers and computations. The results are given in Table xxx The results are given in Table \begin_inset CommandInset ref LatexCommand ref reference "tab:bandwidth" plural "false" caps "false" noprefix "false" \end_inset . We observe a good efficiency of the shift algorithm in the shift-only case: the transfer rates are not very far from the maximal bandwidth of the device, at least for the GPU accelerators. From this results we also observe that the LBM algorithm is clearly memory bound. When the single precision computations are activated on the GPU devices (GTX, Quadro, V100), the elapsed time of the shift-and-relaxation test is not very different from the shift-only test. For the double precision computations, we observe that the V100 device outperforms all the other GPUs. \begin_inset Newline newline \end_inset \begin_inset Float table wide false sideways false status open \begin_layout Plain Layout \begin_inset Tabular ... ... @@ -520,7 +564,11 @@ prec. \begin_layout Plain Layout max. bandwidth (GB/s) theoretical \begin_inset Formula $b$ \end_inset (GB/s) \end_layout \end_inset ... ... @@ -531,7 +579,7 @@ max. \begin_inset Text \begin_layout Plain Layout AMD Intel \end_layout \end_inset ... ... @@ -549,7 +597,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 17.58 \end_layout \end_inset ... ... @@ -558,7 +606,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 13.38 \end_layout \end_inset ... ... @@ -567,7 +615,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 60 \end_layout \end_inset ... ... @@ -578,7 +626,7 @@ float32 \begin_inset Text \begin_layout Plain Layout AMD Intel \end_layout \end_inset ... ... @@ -587,7 +635,7 @@ AMD \begin_inset Text \begin_layout Plain Layout float32 float64 \end_layout \end_inset ... ... @@ -596,7 +644,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 19.12 \end_layout \end_inset ... ... @@ -605,7 +653,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 17.48 \end_layout \end_inset ... ... @@ -614,7 +662,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 60 \end_layout \end_inset ... ... @@ -661,7 +709,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 34 \end_layout \end_inset ... ... @@ -690,7 +738,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 20.08 \end_layout \end_inset ... ... @@ -699,7 +747,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 3.78 \end_layout \end_inset ... ... @@ -708,7 +756,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 34 \end_layout \end_inset ... ... @@ -737,7 +785,45 @@ float32 \begin_inset Text \begin_layout Plain Layout 147.54 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 146.94 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 192 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout GTX \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout float64 \end_layout \end_inset ... ... @@ -746,7 +832,16 @@ float32 \begin_inset Text \begin_layout Plain Layout 148.76 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 49.72 \end_layout \end_inset ... ... @@ -755,7 +850,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 192 \end_layout \end_inset ... ... @@ -784,7 +879,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 336.45 \end_layout \end_inset ... ... @@ -793,7 +888,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 329.06 \end_layout \end_inset ... ... @@ -802,7 +897,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 432 \end_layout \end_inset ... ... @@ -831,7 +926,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 344.50 \end_layout \end_inset ... ... @@ -840,7 +935,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 127.21 \end_layout \end_inset ... ... @@ -849,7 +944,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 432 \end_layout \end_inset ... ... @@ -878,7 +973,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 692.31 \end_layout \end_inset ... ... @@ -887,7 +982,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 676.44 \end_layout \end_inset ... ... @@ -896,7 +991,7 @@ float32 \begin_inset Text \begin_layout Plain Layout 900 \end_layout \end_inset ... ... @@ -925,7 +1020,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 705.88 \end_layout \end_inset ... ... @@ -934,7 +1029,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 610.17 \end_layout \end_inset ... ... @@ -943,7 +1038,7 @@ float64 \begin_inset Text \begin_layout Plain Layout 900 \end_layout \end_inset ... ... @@ -954,6 +1049,34 @@ float64 \end_inset \end_layout \begin_layout Plain Layout \begin_inset Caption Standard \begin_layout Plain Layout Bandwidth efficiency of the LBM algorithm. Comparison of the data transfer rates of the shift-only algorithm and of the shift-and-relaxation algorithm. The resulting bandwidth is compared with the maximal memory bandwidth advertise d by the vendors of the hardware devices \begin_inset CommandInset label LatexCommand label name "tab:bandwidth" \end_inset . \end_layout \end_inset \end_layout \end_inset \end_layout \begin_layout Enumerate ... ...
 ... ... @@ -816,6 +816,9 @@ $\hline Intel&2 x Intel Xeon CPU E5-2609 v4 & CPU & 1.7 GHz & 63 GB & 32 kB & 16 & 16\tabularnewline \hline Iris 640 & Intel Iris Graphics 640 & GPU & 1.0 GHz & 4 GB & 64 kB & 48 & 192\tabularnewline \hline \end{tabular} \par\end{centering} \caption{Characteristics of the OpenCL devices tested in this paper. CU stands for "Compute Units" and "PE" for "Processing Elements".\label{tab:OpenCL-devices-for}} ... ... @@ -861,20 +864,77 @@$ We remark that thanks to the chosen organization into memory, we do not have to use the local memory for accelerating the algorithm. When the program is run on NVIDIA cards, monitoring tools, such as \texttt{nvtop}\footnote{\url{https://github.com/Syllo/nvtop}} indicates that the GPU occupation is of $99\%$. This indicates a quasi-optimal implementation. In order to measure the efficiency of the implementation we perform a memory bandwidth test on several grid sizes. One time step of the method implies the read access in the global memory to the set of fields of the previous time step. The local computations are done in registers. Then there is another write access to global memory for storing the data of the next time step. The size in memory in Gigabyte of one set of fields is $$n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times}\texttt{Ny}\ensuremath{\times b}}\times4\times m}{1024^{3}} ,$$ where $b$ is the number of bytes for storing one floating point number ($b=4$ for single precision and $b=8$ for double precision), % When the program is run on NVIDIA cards, monitoring tools, such as % \texttt{nvtop}\footnote{\url{https://github.com/Syllo/nvtop}} indicates % that the GPU occupation is of $99\%$. This indicates a quasi-optimal % implementation. \revB{In order to measure the efficiency of the implementation we perform a memory bandwidth test for a $512\times512$ grid. One time-step of the method implies the read access in the global memory of the set of fields of the previous time-step. The local computations are done in registers. Then there is another write access to global memory for storing the data of the next time-step. The memory size in Gigabyte of one set of fields is $n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times\texttt{Ny}}\ensuremath{\times prec}}\times4\times m}{1024^{3}},$ where $prec$ is the number of bytes for storing one floating point number ($prec=4$ for single precision and $prec=8$ for double precision). We then perform a given number of time iterations $n_{\text{iter}}$ and measure the elapsed time $t_{\text{elapsed}}$ in the OpenCL kernels. We perform two kind of experiments. In the first experiment, we deactivate the numerical computations and only perform the shift operations. The memory bandwidth of the shift algorithm is then given by $b=\frac{2\times n_{\text{GB}}\times n_{\text{iter}}}{t_{\text{elapsed}}}.$ In the second experiment, we reactivate the computations and measure how the bandwidth is reduced. This allows to evaluating how the elapsed time is shared between memory transfers and computations. The results are given in Table \ref{tab:bandwidth}. We observe a good efficiency of the shift algorithm in the shift-only case: the transfer rates are not very far from the maximal bandwidth of the device, at least for the GPU accelerators. From this results we also observe that the LBM algorithm is clearly memory bound. When the single precision computations are activated on the GPU devices (GTX, Quadro, V100), the elapsed time of the shift-and-relaxation test is not very different from the shift-only test. For the double precision computations, we observe that the V100 device outperforms all the other GPUs.\\} \begin{table}\revB{ \begin{tabular}{|c|c|c|c|c|} \hline & prec. & $b$ (GB/s, shift-only) & $b$ (GB/s, shift-relax) & max. theoretical $b$ (GB/s)\tabularnewline \hline \hline Intel & float32 & 17.58 & 13.38 & 60\tabularnewline \hline Intel & float64 & 19.12 & 17.48 & 60\tabularnewline \hline Iris 640 & float32 & 26.20 & 24.98 & 34\tabularnewline \hline Iris 640 & float64 & 20.08 & 3.78 & 34\tabularnewline \hline GTX & float32 & 147.54 & 146.94 & 192\tabularnewline \hline GTX & float64 & 148.76 & 49.72 & 192\tabularnewline \hline Quadro & float32 & 336.45 & 329.06 & 432\tabularnewline \hline Quadro & float64 & 344.50 & 127.21 & 432\tabularnewline \hline V100 & float32 & 692.31 & 676.44 & 900\tabularnewline \hline V100 & float64 & 705.88 & 610.17 & 900\tabularnewline \hline \end{tabular} \caption{Bandwidth efficiency of the LBM algorithm. Comparison of the data transfer rates of the shift-only algorithm and of the shift-and-relaxation algorithm. The resulting bandwidth is compared with the maximal memory bandwidth advertised by the vendors of the hardware devices\label{tab:bandwidth}.} } \end{table} \section{Numerical applications to MHD} \subsection{Smooth vortex (performance test)} ... ...
 ... ... @@ -166,7 +166,7 @@ def solve_ocl(m=_m, n=_n, nx=_nx, ny=_ny, Lx=_Lx, Ly=_Ly, Tmax=_Tmax, levels=np.linspace(_minplot, _maxplot, 16)) #fig = Figure(title=plot_title) compute = False compute = True print("Compute:") print(compute) ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment