Commit 48dc98ee authored by Philippe Helluy's avatar Philippe Helluy

up

parent f92a3699
......@@ -418,6 +418,21 @@ I would like the authors to give a more detailed explanation of how they
We enriched the computational part of the part with a more precise evaluation
of the performance of the code: memory bandwidth and computational intensity.
This analysis confirms the excellent efficiency of the implementation.
\begin_inset Newline newline
\end_inset
In order to measure the efficiency of the implementation we perform a memory
bandwidth test for a
\begin_inset Formula $512\times512$
\end_inset
grid.
One time-step of the method implies the read access in the global memory
of the set of fields of the previous time-step.
The local computations are done in registers.
Then there is another write access to global memory for storing the data
of the next time-step.
The memory size in Gigabyte of one set of fields is
\begin_inset Formula
\[
n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times\texttt{Ny}}\ensuremath{\times prec}}\times4\times m}{1024^{3}},
......@@ -438,19 +453,22 @@ where
\end_inset
for double precision).
We then perform a given number of time iterations niter and measure the
elapsed time
We then perform a given number of time iterations
\begin_inset Formula $n_{\text{iter}}$
\end_inset
and measure the elapsed time
\begin_inset Formula $t_{\text{elapsed}}$
\end_inset
(with a specific features of the OpenCL library).
in the OpenCL kernels.
We perform two kind of experiments.
In the first experiment, we deactivate the numerical computations and only
perform the shift operations.
The memory bandwidth of the shift algorithm is then given by
\begin_inset Formula
\[
b=\frac{2\times n_{\text{GB}}\times niter}{t_{\text{elapsed}}}.
b=\frac{2\times n_{\text{GB}}\times n_{\text{iter}}}{t_{\text{elapsed}}}.
\]
\end_inset
......@@ -459,13 +477,39 @@ In the second experiment, we reactivate the computations and measure how
the bandwidth is reduced.
This allows to evaluating how the elapsed time is shared between memory
transfers and computations.
The results are given in Table xxx
The results are given in Table
\begin_inset CommandInset ref
LatexCommand ref
reference "tab:bandwidth"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
We observe a good efficiency of the shift algorithm in the shift-only case:
the transfer rates are not very far from the maximal bandwidth of the device,
at least for the GPU accelerators.
From this results we also observe that the LBM algorithm is clearly memory
bound.
When the single precision computations are activated on the GPU devices
(GTX, Quadro, V100), the elapsed time of the shift-and-relaxation test
is not very different from the shift-only test.
For the double precision computations, we observe that the V100 device
outperforms all the other GPUs.
\begin_inset Newline newline
\end_inset
\begin_inset Float table
wide false
sideways false
status open
\begin_layout Plain Layout
\begin_inset Tabular
<lyxtabular version="3" rows="10" columns="5">
<lyxtabular version="3" rows="11" columns="5">
<features tabularvalignment="middle">
<column alignment="center" valignment="top">
<column alignment="center" valignment="top">
......@@ -520,7 +564,11 @@ prec.
\begin_layout Plain Layout
max.
bandwidth (GB/s)
theoretical
\begin_inset Formula $b$
\end_inset
(GB/s)
\end_layout
\end_inset
......@@ -531,7 +579,7 @@ max.
\begin_inset Text
\begin_layout Plain Layout
AMD
Intel
\end_layout
\end_inset
......@@ -549,7 +597,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
17.58
\end_layout
\end_inset
......@@ -558,7 +606,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
13.38
\end_layout
\end_inset
......@@ -567,7 +615,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
60
\end_layout
\end_inset
......@@ -578,7 +626,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
AMD
Intel
\end_layout
\end_inset
......@@ -587,7 +635,7 @@ AMD
\begin_inset Text
\begin_layout Plain Layout
float32
float64
\end_layout
\end_inset
......@@ -596,7 +644,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
19.12
\end_layout
\end_inset
......@@ -605,7 +653,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
17.48
\end_layout
\end_inset
......@@ -614,7 +662,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
60
\end_layout
\end_inset
......@@ -661,7 +709,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
34
\end_layout
\end_inset
......@@ -690,7 +738,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
20.08
\end_layout
\end_inset
......@@ -699,7 +747,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
3.78
\end_layout
\end_inset
......@@ -708,7 +756,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
34
\end_layout
\end_inset
......@@ -737,7 +785,45 @@ float32
\begin_inset Text
\begin_layout Plain Layout
147.54
\end_layout
\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
146.94
\end_layout
\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
192
\end_layout
\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
GTX
\end_layout
\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
float64
\end_layout
\end_inset
......@@ -746,7 +832,16 @@ float32
\begin_inset Text
\begin_layout Plain Layout
148.76
\end_layout
\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
49.72
\end_layout
\end_inset
......@@ -755,7 +850,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
192
\end_layout
\end_inset
......@@ -784,7 +879,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
336.45
\end_layout
\end_inset
......@@ -793,7 +888,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
329.06
\end_layout
\end_inset
......@@ -802,7 +897,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
432
\end_layout
\end_inset
......@@ -831,7 +926,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
344.50
\end_layout
\end_inset
......@@ -840,7 +935,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
127.21
\end_layout
\end_inset
......@@ -849,7 +944,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
432
\end_layout
\end_inset
......@@ -878,7 +973,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
692.31
\end_layout
\end_inset
......@@ -887,7 +982,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
676.44
\end_layout
\end_inset
......@@ -896,7 +991,7 @@ float32
\begin_inset Text
\begin_layout Plain Layout
900
\end_layout
\end_inset
......@@ -925,7 +1020,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
705.88
\end_layout
\end_inset
......@@ -934,7 +1029,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
610.17
\end_layout
\end_inset
......@@ -943,7 +1038,7 @@ float64
\begin_inset Text
\begin_layout Plain Layout
900
\end_layout
\end_inset
......@@ -954,6 +1049,34 @@ float64
\end_inset
\end_layout
\begin_layout Plain Layout
\begin_inset Caption Standard
\begin_layout Plain Layout
Bandwidth efficiency of the LBM algorithm.
Comparison of the data transfer rates of the shift-only algorithm and of
the shift-and-relaxation algorithm.
The resulting bandwidth is compared with the maximal memory bandwidth advertise
d by the vendors of the hardware devices
\begin_inset CommandInset label
LatexCommand label
name "tab:bandwidth"
\end_inset
.
\end_layout
\end_inset
\end_layout
\end_inset
\end_layout
\begin_layout Enumerate
......
......@@ -816,6 +816,9 @@ $
\hline
Intel&2 x Intel Xeon CPU E5-2609 v4 & CPU & 1.7 GHz & 63 GB & 32 kB & 16 & 16\tabularnewline
\hline
Iris 640 &
Intel Iris Graphics 640 & GPU & 1.0 GHz & 4 GB & 64 kB & 48 & 192\tabularnewline
\hline
\end{tabular}
\par\end{centering}
\caption{Characteristics of the OpenCL devices tested in this paper. CU stands for "Compute Units" and "PE" for "Processing Elements".\label{tab:OpenCL-devices-for}}
......@@ -861,20 +864,77 @@ $
We remark that thanks to the chosen organization into memory, we do
not have to use the local memory for accelerating the algorithm.
When the program is run on NVIDIA cards, monitoring tools, such as
\texttt{nvtop}\footnote{\url{https://github.com/Syllo/nvtop}} indicates
that the GPU occupation is of $99\%$. This indicates a quasi-optimal
implementation.
In order to measure the efficiency of the implementation we perform a memory bandwidth test
on several grid sizes. One time step of the method implies the read access in the global memory
to the set of fields of the previous time step. The local computations are done in registers. Then
there is another write access to global memory for storing the data of the next time step.
The size in memory in Gigabyte of one set of fields is
$$
n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times}\texttt{Ny}\ensuremath{\times b}}\times4\times m}{1024^{3}}
,$$
where $b$ is the number of bytes for storing one floating point number ($b=4$ for single precision and $b=8$ for double precision),
% When the program is run on NVIDIA cards, monitoring tools, such as
% \texttt{nvtop}\footnote{\url{https://github.com/Syllo/nvtop}} indicates
% that the GPU occupation is of $99\%$. This indicates a quasi-optimal
% implementation.
\revB{In order to measure the efficiency of the implementation we perform
a memory bandwidth test for a $512\times512$ grid. One time-step
of the method implies the read access in the global memory of the
set of fields of the previous time-step. The local computations are
done in registers. Then there is another write access to global memory
for storing the data of the next time-step. The memory size in Gigabyte
of one set of fields is
\[
n_{\text{GB}}=\frac{\texttt{\texttt{Nx}\ensuremath{\times\texttt{Ny}}\ensuremath{\times prec}}\times4\times m}{1024^{3}},
\]
where $prec$ is the number of bytes for storing one floating point
number ($prec=4$ for single precision and $prec=8$ for double precision).
We then perform a given number of time iterations $n_{\text{iter}}$
and measure the elapsed time $t_{\text{elapsed}}$ in the OpenCL kernels.
We perform two kind of experiments. In the first experiment, we deactivate
the numerical computations and only perform the shift operations.
The memory bandwidth of the shift algorithm is then given by
\[
b=\frac{2\times n_{\text{GB}}\times n_{\text{iter}}}{t_{\text{elapsed}}}.
\]
In the second experiment, we reactivate the computations and measure
how the bandwidth is reduced. This allows to evaluating how the elapsed
time is shared between memory transfers and computations. The results
are given in Table \ref{tab:bandwidth}. We observe a good efficiency
of the shift algorithm in the shift-only case: the transfer rates
are not very far from the maximal bandwidth of the device, at least
for the GPU accelerators. From this results we also observe that the
LBM algorithm is clearly memory bound. When the single precision computations
are activated on the GPU devices (GTX, Quadro, V100), the elapsed
time of the shift-and-relaxation test is not very different from the
shift-only test. For the double precision computations, we observe
that the V100 device outperforms all the other GPUs.\\}
\begin{table}\revB{
\begin{tabular}{|c|c|c|c|c|}
\hline
& prec. & $b$ (GB/s, shift-only) & $b$ (GB/s, shift-relax) & max. theoretical $b$ (GB/s)\tabularnewline
\hline
\hline
Intel & float32 & 17.58 & 13.38 & 60\tabularnewline
\hline
Intel & float64 & 19.12 & 17.48 & 60\tabularnewline
\hline
Iris 640 & float32 & 26.20 & 24.98 & 34\tabularnewline
\hline
Iris 640 & float64 & 20.08 & 3.78 & 34\tabularnewline
\hline
GTX & float32 & 147.54 & 146.94 & 192\tabularnewline
\hline
GTX & float64 & 148.76 & 49.72 & 192\tabularnewline
\hline
Quadro & float32 & 336.45 & 329.06 & 432\tabularnewline
\hline
Quadro & float64 & 344.50 & 127.21 & 432\tabularnewline
\hline
V100 & float32 & 692.31 & 676.44 & 900\tabularnewline
\hline
V100 & float64 & 705.88 & 610.17 & 900\tabularnewline
\hline
\end{tabular}
\caption{Bandwidth efficiency of the LBM algorithm. Comparison of the data
transfer rates of the shift-only algorithm and of the shift-and-relaxation
algorithm. The resulting bandwidth is compared with the maximal memory
bandwidth advertised by the vendors of the hardware devices\label{tab:bandwidth}.}
}
\end{table}
\section{Numerical applications to MHD}
\subsection{Smooth vortex (performance test)}
......
......@@ -166,7 +166,7 @@ def solve_ocl(m=_m, n=_n, nx=_nx, ny=_ny, Lx=_Lx, Ly=_Ly, Tmax=_Tmax,
levels=np.linspace(_minplot, _maxplot, 16))
#fig = Figure(title=plot_title)
compute = False
compute = True
print("Compute:")
print(compute)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment