CUDA计算编程运行.pdf

异构计算及 CUDA 程序编译运行简介张文帅 (wszhang@ustc.edu.cn) 中国科学技术大学超级计算中心 2015 年 11 月 26 日目录 1 异构众核计算异构计算的现状与优势异构计算的实现方式 2 CUDA 计算与程序编译运行 CUDA 计算简介 CUDA 编译环境搭建 CUDA 程序编译异构计算及 CUDA 程序编译运行简介 [2 / 45] 异构计算现状众核异构并行计算发展迅速，当前 Top 500 超级计算机中有 104 台使用众核异构加速卡 [66 Nvidia GPU (CUDA) + 29 Intel Xeon Phi (MIC) + 4 Nvidia & Xeon Phi + 3 ATI Radeon + 2 PEZY-SC(1024 cores) ] Top500 http://top500.org Tianhe-2 包含 16000 节点，每节点 2 Ivy Bridge chips and 3 Xeon Phi chips。异构计算及 CUDA 程序编译运行简介异构众核计算异构计算的现状与优势 [3 / 45] 计算体系结构分类 Single Instruction Single Data (SISD) Multiple Instruction Single Data (MISD) Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD) [https://computing.llnl.gov/tutorials/parallel_comp] 异构计算可以更好的实现 MIMD。在 CUDA 编程中，一个 warp(包含 32 线程) 为一个基本调度单元，只能执行同一个指令任务，属于 SIMD; 同时多个不同 block 任务可以在不同的 SM 流多处理器中执行，构成 MIMD。异构计算及 CUDA 程序编译运行简介异构众核计算异构计算的现状与优势 [4 / 45] 并行计算分类主流的并行计算框架按照存储方式，可以分成 GPU 内存模型 • 共享式存储，单一地址空间 (物理或逻辑) • 分布式存储 • 混合分布式共享存储 (主流) 异构计算及 CUDA 程序编译运行简介异构众核计算异构计算的现状与优势 [5 / 45] 并行计算分类按照计算机系统的硬件架构组成，区分为 • 通用架构并行 • 同构多核并行 (CPU) • 异构众核并行 (CPU+GPU / CPU+MIC) • 专用架构并行 • CPU+FPGA 异构异构计算及 CUDA 程序编译运行简介异构众核计算异构计算的现状与优势 [6 / 45] 异构计算并行设计 (CUDA 为例) 异构计算及 CUDA 程序编译运行简介异构众核计算异构计算的实现方式 [7 / 45] 异构计算执行流程通常异构计算中的加速卡部分执行流程分为四个部分，如 CUDA 程序的执行流程: CUDA MIC 对于简单使用编译指导语句的 MIC 计算，在 oﬄoad 模式下数据的拷贝操作被隐藏，但依然存在，故而存储带宽经常成为制约速度的主要因素，未来与主机高速共享内存是正在发展的解决方案之一。此外，MIC 程序还具有 native 执行模式，可以登录到卡上 linux 系统执行程序，具有更好的易用性和通用性。异构计算及 CUDA 程序编译运行简介异构众核计算异构计算的实现方式 [8 / 45] 目录 1 异构众核计算异构计算的现状与优势异构计算的实现方式 2 CUDA 计算与程序编译运行 CUDA 计算简介 CUDA 编译环境搭建 CUDA 程序编译异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 [9 / 45] GPU 应用 GPU 应用领域广泛，如教育科研，影视创作，智能机器学习等。在本校专注最多的量子化学领域，也陆续支持 GPU 加速，如主流的 Abinit，Quantum Espress，以及 VASP (Now 2.5 to 4X Faster on Tesla K80)。异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 计算简介 [10 / 45] GPU 硬件微架构 (Kepler GK110) 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 计算简介 [11 / 45] GPU SMX 192 SP and 64 DP units, 32 special function and 32 load/store units 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 计算简介 [12 / 45] CUDA 安装准备 • 确保具有 root 权限 • 确认已经安装 gcc 编译器 • 确认硬件支持 CUDA：https://developer.nvidia.com/cuda-gpus 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [13 / 45] CUDA 安装准备 • 确认系统支持 CUDA：https://developer.nvidia.com/cuda-toolkit-archive，选择希望安装的 CUDA 版本，下载自己系统对应的软件包，或者下载通用的.run 后缀的软件包。异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [14 / 45] CUDA 安装 1 禁用图形显示界面 • # service lightdm stop 或 • # init 3 2 运行安装程序 • # sh cuda_7.0.28_linux.run 3 按提示选择安装组件，并设置安装路径 • 可以全部 yes，并默认路径异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [15 / 45] CUDA 环境变量 1 配置 PATH • # export PATH=$PATH:/path-to-cuda/bin 2 配置 LD_LIBRARY_PATH • # export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path-to-cuda/lib64 • # export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path-to-cuda/lib64/stubs (含 -lcuda 库) 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [16 / 45] CUDA 组件如果选择通用软件包，并执行 ./cuda_7.0.28_linux.run -extract=~/Path 可以看到会产生三个 CUDA 安装组件： • 设备驱动 NVIDIA-Linux-x86_64-346.46.run • CUDA Toolkit 程序开发包 cuda-linux64-rel-7.0.28-19326674.run • 程序开发事例包 cuda-samples-linux-7.0.28-19326674.run 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [17 / 45] CUDA Toolkit 组件 • Devlopment Tools • NVCC，PTXAS，cuobjdump，Nsight Eclipse • Libraries • cuBLAS，cuFFT，cuRAND，cuSPARES, NPP 等 • Tools • CUDA-GDB，CUDA-MEMCHECK，Visual Proﬁler，NVIDIA-SMI，NVML 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [18 / 45] nvidia-smi 使用此命令可以查看显卡驱动与运行状态，详细参数: nvidia-smi -a 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [19 / 45] Occupancy Calculator CUDA 占用率计算器可以计算在某个 CUDA 内核下 GPU 中多处理器的占用情况，即活动 Warp 数与 GPU 支持的 Warp 最大数比率。异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [20 / 45] IDE : nsight 集成开发界面异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 编译环境搭建 [21 / 45] CUDA 编译器 nvcc 输入文件 Input File Preﬁx Description .cu CUDA source ﬁle, containing host code and device functions .c C source ﬁle .cc, .cxx, .cpp C++ source ﬁle 异构计算及 CUDA 程序编译运行简介 .gpu GPU intermediate ﬁle .ptx PTX intermediate assembly ﬁle .o, .obj Object ﬁle .a, .lib Library ﬁle .so Shared object ﬁle CUDA 计算与程序编译运行 CUDA 程序编译 [22 / 45] CUDA 编译选项 Phase short nvcc Option Default Output File Name CUDA compilation to C/C++ source ﬁle -cuda .cpp.ii appended to source ﬁle name, as in x.cu.cpp.ii. C/C++ preprocessing -E C/C++ compilation to object ﬁle -c suﬃx replaced by o on Linux/Mac, obj on Win Cubin from CUDA source ﬁles -cubin Source ﬁle name with suﬃx replaced by cubin Cubin from .gpu intermediate ﬁles -cubin Source ﬁle name with suﬃx replaced by cubin Cubin from PTX intermediate ﬁles. -cubin Source ﬁle name with suﬃx replaced by cubin PTX from CUDA source ﬁles -ptx Source ﬁle name with suﬃx replaced by ptx PTX from .gpu intermediate ﬁles -ptx Source ﬁle name with suﬃx replaced by ptx Fatbinary from source, PTX or cubin ﬁles -fatbin Source ﬁle name with suﬃx replaced by fatbin GPU C code from CUDA source ﬁles -gpu Source ﬁle name with suﬃx replaced by gpu a_dlink.obj on Win or a_dlink.o on other platforms Linking relocatable device code. -dlink Cubin from linked relocatable device code. -dlink-cubin a_dlink.cubin Fatbinary from linked relocatable device code -dlink-fatbin a_dlink.fatbin Linking an executable Constructing an object ﬁle archive, or library -lib a.lib on Windows or a.a on other platforms make dependency generation -M Running an executable –run 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [23 / 45] CUDA 两步编译为了程序兼容性，CUDA 编译被设计为两步，先在虚拟机架构下编译成类似汇编代码的 PTX 中间文件，而后在第二步中编译成最后的执行文件，PTX 中间件也可以运行时动态的再编译并运行。异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [24 / 45] CUDA 中间虚拟架构 Architecture 真实架构 Feature sm_20 Basic features + Fermi support sm_30 and sm_32 + Kepler support + Uniﬁed memory programming sm_35 + Dynamic parallelism support sm_50, sm_52, and sm_53 + Maxwell support compute_20 Basic features + Fermi support compute_30 and compute_32 + Kepler support + Uniﬁed memory programming compute_35 + Dynamic parallelism support compute_50, compute_52, and compute_53 + Maxwell support 虚拟架构 • –gpu-architecture (可简写为 -arch) arch • Specify the name of the class of NVIDIA virtual GPU architecture for which the CUDA input ﬁles must be compiled. • –gpu-code (可简写为 -code) code • Specify the name of the NVIDIA GPU to assemble and optimize PTX for. 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [25 / 45] CUDA C/C++ 可以分别独立编译异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [26 / 45] CUDA 常用编译方法一般 cuda 程序编译，前者虚拟架构版本号需小于真实架构版本号 nvcc x.cu –gpu-architecture=compute_20 –gpu-code=compute_20 多种架构编译 nvcc x.cu –gpu-architecture=compute_30 –gpu-code=compute_30,sm_30,sm_35 nvcc x.cu –gpu-architecture=sm_35 nvcc x.cu –gpu-architecture=compute_30 分别等价于 nvcc x.cu –gpu-architecture=compute_35 –gpu-code=sm_35,compute_35 nvcc x.cu –gpu-architecture=compute_30 –gpu-code=compute_30 CUDA，C/C++ 分别编译后链接 nvcc –gpu-architecture=sm_20 –device-c a.cu b.cu nvcc –gpu-architecture=sm_20 –device-link a.o b.o –output-ﬁle link.o g++ a.o b.o link.o –library-path= –library=cudart 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [27 / 45] CUDA 程序编译运行 (ThreadBlock.cu) I nvcc -gencode=arch=compute_30,code=sm_30 ThreadBlock.cu -O2 -o ThreadBlock.o 1#include 2 //# include 3 4 __global__ void vadd ( f l o a t *a , f l o a t *b , f l o a t *c , i n t nn ) { 5 6 7 8 9 i n t i i = 3 * blockIdx . x + threadIdx . x ; / / i n t i i = blockDim . x * blockIdx . x + threadIdx . x ; / / i n t i i = threadIdx . x ; c [ i i ] = a[ i i ] + b[ i i ] ; 10} 11 12 i n t 13 main( void ) { cudaError_t err = cudaSuccess ; 14 15 i n t nn = 6 ; 16 17 18 19 20 s i z e _ t size = nn* sizeof ( f l o a t ) ; f l o a t *h_a = ( f l o a t * ) malloc ( size ) ; f l o a t *h_b = ( f l o a t * ) malloc ( size ) ; f l o a t *h_c = ( f l o a t * ) malloc ( size ) ; 21 22 23 24 f o r ( i n t i =0; i< nn ; ++i ) { h_a [ i ] = rand ( ) / ( f l o a t )RAND_MAX; 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [28 / 45] CUDA 程序编译运行 (ThreadBlock.cu) II h_b [ i ] = rand ( ) / ( f l o a t )RAND_MAX; 25 26 } 27 28 29 30 31 32 33 f l o a t *d_a = NULL; err = cudaMalloc ( ( void * * )&d_a , size ) ; f l o a t *d_b = NULL; err = cudaMalloc ( ( void * * )&d_b , size ) ; f l o a t *d_c = NULL; err = cudaMalloc ( ( void * * )&d_c , size ) ; 34 35 36 37 38 39 err = cudaMemcpy( d_a , h_a , size , cudaMemcpyHostToDevice ) ; i f ( err != cudaSuccess ) { f p r i n t f ( stderr , ”no . 1122311 \ n”) ; } 40 41 42 43 44 45 err = cudaMemcpy( d_b , h_b , size , cudaMemcpyHostToDevice ) ; i f ( err != cudaSuccess ) { f p r i n t f ( stderr , ”no . 1122344 \ n”) ; } 46 47 vadd<<<2,3>>>(d_a , d_b , d_c , nn) ; 48 49 err = cudaMemcpy( h_c , d_c , size , cudaMemcpyDeviceToHost) ; 50 51 52 f o r ( i n t i = 0; i < nn ; ++i ) { 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [29 / 45] CUDA 程序编译运行 (ThreadBlock.cu) III i f ( fabs ( h_a [ i ] + h_b [ i ] − h_c [ i ] ) > 1e−4 ) { f p r i n t f ( stderr , ” f a i l e d at element %d! \ n” , i ) ; } else { f p r i n t f ( stderr , ” succeed at element %d! \ n” , i ) ; } 53 54 55 56 57 58 59 60 } 61 62 63 64 err = cudaFree ( d_a ) ; err = cudaFree ( d_b ) ; err = cudaFree ( d_c ) ; 65 66 67 68 free ( h_a ) ; free ( h_b ) ; free ( h_c ) ; 69 70 return 0; 71} 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [30 / 45] Fortran 程序调用 CUDA 函数库 I fortran.c sgemm_speed.f90 gcc -O3 -DCUBLAS_USE_THUNKING -I/opt/cuda-7.5/include -c fortran.c ifort -o sgemm_speed_cublas -O3 -fpp -DCUBLAS sgemm_speed.f90 fortran.o -L/opt/cuda-7.5/lib64 -lcublas 1! 2! 3! Simple Fortan90 program that m u l t i p l i e s 2 square matrices c a l l i n g Sgemm C = alpha A*B + beta C 4! 5program matrix_multiply 6 7implicit none 8 9! Define the f l o a t i n g point kind to be single_precision parameter : : fp_kind = kind ( 0 . 0 ) 10 integer , 11 12 ! Define ( fp_kind ) , dimension ( : , : ) , allocatable : : A, B, C 14 real : : time_start , time_end 15 real ( fp_kind ) : : alpha=1._fp_kind , beta=1._fp_kind , c_rig h t 16 integer : : i , j ,m1,m2 13 real 17 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [31 / 45] Fortran 程序调用 CUDA 函数库 II 18 19do m1=512,10240,512 20 allocate (A(m1,m1) ) allocate (B(m1,m1) ) 23 allocate (C(m1,m1) ) 21 22 24 ! I n i t i a l i z e the matrices A,B and C A=1._fp_kind 27 B=2._fp_kind 28 C=3._fp_kind 25 26 29 30 31 ! With the prescribed inputs , each element of the C matrix should be equal to c _r ight c _ r i ght= 2. _fp_kind *m1+3._fp_kind 32 33 ! 34 Compute the matrix product c a l l cpu_time ( time_start ) computation 35 36#i f d e f CUBLAS c a l l cublas_SGEMM ( ’ n ’ , ’ n ’ ,m1,m1,m1, alpha , A,m1, B,m1, beta , C,m1) 38#else 39 c a l l SGEMM ( ’ n ’ , ’ n ’ ,m1,m1,m1, alpha , A,m1, B,m1, beta , C,m1) 40#endif 37 41 42 c a l l cpu_time ( time_end ) 43 44 ! P r i n t timing information 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [32 / 45] Fortran 程序调用 CUDA 函数库 III 45 p r i n t ”( i5 ,1x , a,1x , f8 .4 ,2x , a , f12 . 4 ) ” , m1, ” time =”,time_end−time_start , ” MFLOPS=” ,1.e−6*2._fp_kind *m1*m1*m1/ ( time_end−time_start ) 46 47 ! 48 49 50 51 52 53 54 55 check the r e s u l t do j =1,m1 do i =1,m1 i f ( abs ( c ( i , j )− c_ rig ht ) . gt . 1.d−8 ) then p r i n t * , ”sgemm f a i l e d ” , i , j , abs ( c ( i , j )− c _rig ht ) exit end i f end do end do 56 57 deallocate (A, B,C) do 58end 59 60end program matrix_multiply 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [33 / 45] 提交作业 (多 GPU 程序) I bsub -m nodename -q k40 -e %J.err -o %J.log ./simpleMultiGPU 1/* * Copyright 1993−2015 NVIDIA Corporation . A l l r i g h t s reserved . * 4 * Please refer to the NVIDIA end user license agreement (EULA) associated 5 * with t h i s source code f o r terms and conditions that govern your use of 6 * t h i s software . Any use , reproduction , disclosure , or d i s t r i b u t i o n of 7 * t h i s software and related documentation outside the terms of the EULA 8 * i s s t r i c t l y prohibited . 9 * 10 * / 2 3 11 12 / * * This application demonstrates how to use the CUDA API to use multiple GPUs, * with an emphasis on simple i l l u s t r a t i o n of the techniques ( not on performance ) . 15 * 16 * Note that in order to detect multiple GPUs in your system you have to disable 17 * SLI in the nvidia control panel . Otherwise only one GPU i s v i s i b l e to the 18 * application . On the other side , you can s t i l l extend your desktop to screens 19 * attached to both GPUs. 20 * / 13 14 21 22 / / System includes 23#include 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [34 / 45] 提交作业 (多 GPU 程序) II 24#include 25 26 / / CUDA runtime 27#include 28 29 / / helper functions and u t i l i t i e s to work with CUDA 31#include32#include 30#include 33 34#i f n d e f 35#define MAX MAX( a , b) (a > b ? a : b) 36#endif 37 38#include ”simpleMultiGPU . h” 39 40 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 41 / / Data configuration 42 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 43 const 44 const i n t MAX_GPU_COUNT = 32; i n t DATA_N = 1048576 * 32; 45 46 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 47 / / Simple reduction kernel . Refer to the ’ reduction ’ CUDA Sample describing 49 / / reduction optimization strategies 50 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 51 __global__ s t a t i c void reduceKernel ( f l o a t * d_Result , f l o a t * d_Input , i n t N) 48 / / 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [35 / 45] 提交作业 (多 GPU 程序) III 52{ 53 54 55 const i n t t i d = blockIdx . x * blockDim . x + threadIdx . x ; const i n t threadN = gridDim . x * blockDim . x ; f l o a t sum = 0; 56 57 58 f o r ( i n t pos = t i d ; pos < N; pos += threadN ) sum += d_Input [ pos ] ; 59 60 d_Result [ t i d ] = sum; 61} 62 63 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 64 / / Program main 65 / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 66 i n t main( i n t argc , char **argv ) 67{ 68 69 / / Solver config TGPUplan plan [MAX_GPU_COUNT] ; 70 71 72 / /GPU reduction r e s u l t s float h_SumGPU[MAX_GPU_COUNT] ; 73 74 75 f l o a t sumGPU; double sumCPU, d i f f ; 76 77 i n t i , j , gpuBase , GPU_N; 78 79 const i n t BLOCK_N = 32; 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [36 / 45] 提交作业 (多 GPU 程序) IV 80 81 const i n t THREAD_N = 256; const i n t ACCUM_N = BLOCK_N * THREAD_N; 82 83 84 p r i n t f (” Starting simpleMultiGPU \ n”) ; checkCudaErrors ( cudaGetDeviceCount(&GPU_N) ) ; 85 86 87 88 89 i f (GPU_N > MAX_GPU_COUNT) { GPU_N = MAX_GPU_COUNT; } 90 91 p r i n t f (”CUDA−capable device count : %i \ n” , GPU_N) ; 92 93 p r i n t f (” Generating input data . . . \ n \ n”) ; 94 95 96 97 98 99 100 / / Subdividing input data across GPUs / / Get data sizes f o r each GPU f o r ( i = 0; i < GPU_N; i++) { plan [ i ] . dataN = DATA_N / GPU_N; } 101 102 103 104 105 106 / / Take into account ”odd” data sizes f o r ( i = 0; i < DATA_N % GPU_N; i++) { plan [ i ] . dataN++; } 107 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [37 / 45] 提交作业 (多 GPU 程序) V 108 109 / / Assign data ranges to GPUs gpuBase = 0; 110 111 112 113 114 115 f o r ( i = 0; i < GPU_N; i++) { plan [ i ] . h_Sum = h_SumGPU + i ; gpuBase += plan [ i ] . dataN ; } 116 117 118 119 120 121 122 123 124 125 126 / / Create streams f o r issuing GPU command asynchronously and allocate memory ( GPU and System page−locked ) f o r ( i = 0; i < GPU_N; i++) { checkCudaErrors ( cudaSetDevice ( i ) ) ; checkCudaErrors ( cudaStreamCreate(&plan [ i ] . stream ) ) ; / / Allocate memory checkCudaErrors ( cudaMalloc ( ( void * * )&plan [ i ] . d_Data , plan [ i ] . dataN * siz eof ( f l o a t ) ) ) ; checkCudaErrors ( cudaMalloc ( ( void * * )&plan [ i ] . d_Sum, ACCUM_N * sizeof ( float ) ) ) ; checkCudaErrors ( cudaMallocHost ( ( void * * )&plan [ i ] . h_Sum_from_device , ACCUM_N * si zeof ( f l o a t ) ) ) ; checkCudaErrors ( cudaMallocHost ( ( void * * )&plan [ i ] . h_Data , plan [ i ] . dataN * siz eof ( f l o a t ) ) ) ; 127 128 129 130 f o r ( j = 0; j < plan [ i ] . dataN ; j++) { plan [ i ] . h_Data [ j ] = ( f l o a t ) rand ( ) / ( f l o a t )RAND_MAX; 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [38 / 45] 提交作业 (多 GPU 程序) VI } 131 132 } 133 134 135 136 / / Start timing and compute on GPU( s ) p r i n t f (”Computing with %d GPUs . . . \ n” , GPU_N) ; StartTimer ( ) ; 137 138 139 140 141 142 / / Copy data to GPU, launch the kernel and copy data back . A l l asynchronously f o r ( i = 0; i < GPU_N; i++) { / / Set device checkCudaErrors ( cudaSetDevice ( i ) ) ; 143 / / Copy input data from CPU checkCudaErrors (cudaMemcpyAsync( plan [ i ] . d_Data , plan [ i ] . h_Data , plan [ i ] . dataN * s iz eof ( f l o a t ) , cudaMemcpyHostToDevice , plan [ i ] . stream ) ) ; 144 145 146 / / Perform GPU computations reduceKernel<<>>(plan [ i ] . d_Sum, plan [ i ] . d_Data , plan [ i ] . dataN ) ; getLastCudaError (” reduceKernel ( ) execution f a i l e d . \ n”) ; 147 148 149 150 / / Read back GPU r e s u l t s checkCudaErrors (cudaMemcpyAsync( plan [ i ] . h_Sum_from_device , plan [ i ] . d_Sum, ACCUM_N * siz eof ( f l o a t ) , cudaMemcpyDeviceToHost , plan [ i ] . stream ) ) ; 151 152 153 } 154 155 / / Process GPU r e s u l t s 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [39 / 45] 提交作业 (多 GPU 程序) VII 156 157 158 f o r ( i = 0; i < GPU_N; i++) { f l o a t sum; 159 / / Set device checkCudaErrors ( cudaSetDevice ( i ) ) ; 160 161 162 / / Wait f o r a l l operations to f i n i s h cudaStreamSynchronize ( plan [ i ] . stream ) ; 163 164 165 / / F i n a l i z e GPU reduction f o r current subvector sum = 0; 166 167 168 f o r ( j = 0; j < ACCUM_N; j++) { sum += plan [ i ] . h_Sum_from_device [ j ] ; } 169 170 171 172 173 * ( plan [ i ] . h_Sum) = ( f l o a t )sum; 174 175 / / Shut down t h i s GPU checkCudaErrors ( cudaFreeHost ( plan [ i ] . h_Sum_from_device ) ) ; checkCudaErrors ( cudaFree ( plan [ i ] . d_Sum) ) ; checkCudaErrors ( cudaFree ( plan [ i ] . d_Data ) ) ; checkCudaErrors ( cudaStreamDestroy ( plan [ i ] . stream ) ) ; 176 177 178 179 180 181 } 182 183 sumGPU = 0; 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [40 / 45] 提交作业 (多 GPU 程序) VIII 184 185 186 187 188 f o r ( i = 0; i < GPU_N; i++) { sumGPU += h_SumGPU[ i ] ; } 189 190 p r i n t f (” GPU Processing time : %f (ms) \ n \ n” , GetTimer ( ) ) ; 191 192 193 / / Compute on Host CPU p r i n t f (”Computing with Host CPU . . . \ n \ n”) ; 194 195 sumCPU = 0; 196 197 198 199 200 201 202 203 f o r ( i = 0; i < GPU_N; i++) { f o r ( j = 0; j < plan [ i ] . dataN ; j++) { sumCPU += plan [ i ] . h_Data [ j ] ; } } 204 205 206 207 208 209 / / Compare GPU and CPU r e s u l t s p r i n t f (”Comparing GPU and Host CPU r e s u l t s . . . \ n”) ; d i f f = fabs (sumCPU − sumGPU) / fabs (sumCPU) ; p r i n t f (” GPU sum: %f \ n CPU sum: %f \ n” , sumGPU, sumCPU) ; p r i n t f (” Relative difference : %E \ n \ n” , d i f f ) ; 210 211 / / Cleanup and shutdown 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [41 / 45] 提交作业 (多 GPU 程序) IX 212 213 214 215 f o r ( i = 0; i < GPU_N; i++) { checkCudaErrors ( cudaSetDevice ( i ) ) ; checkCudaErrors ( cudaFreeHost ( plan [ i ] . h_Data ) ) ; 216 / / cudaDeviceReset causes the driver to clean up a l l state . While / / not mandatory in normal operation , i t i s good practice . I t i s also / / needed to ensure correct operation when the application i s being / / p r o f i l e d . Calling cudaDeviceReset causes a l l p r o f i l e data to be / / flushed before the application e x its cudaDeviceReset ( ) ; 217 218 219 220 221 222 223 } 224 225 e x i t ( ( d i f f < 1e−5) ? EXIT_SUCCESS : EXIT_FAILURE ) ; 226} 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [42 / 45] 作业状态输出 I 1Sender : 2 Subject : LSF System Job 27734: <./simpleMultiGPU> Done 3 4 Job <./simpleMultiGPU> was submitted from host by user in c luste r . 5 Job was executed on host ( s ) , in queue , as user in c luste r . 6 was used as the home directory . 7 was used as the working directory . 8 Started at Wed Nov 25 21:03:01 2015 9 Results reported at Wed Nov 25 21:03:15 2015 10 11 Your job looked l i k e : 12 13−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 14# LSBATCH: User input 15 . / simpleMultiGPU 16−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 17 18 Successfully completed . 19 20Resource usage summary: 21 22 23 24 CPU time : Max Memory : Max Swap : 异构计算及 CUDA 程序编译运行简介 4.56 sec . 1 MB 32 MB CUDA 计算与程序编译运行 CUDA 程序编译 [43 / 45] 作业状态输出 II 25 Max Processes Max Threads 26 27 : : 1 1 28 29The output ( i f any) follows : 30 31 Starting simpleMultiGPU device count : 2 33 Generating input data . . . 32CUDA−capable 34 35Computing 36 with 2 GPUs . . . GPU Processing time : 10.902000 (ms) 37 38Computing with Host CPU . . . 39 40Comparing 41 42 43 GPU and Host CPU r e s u l t s . . . GPU sum: 16777280.000000 CPU sum: 16777294.395033 Relative difference : 8.580068E−07 44 45PS : 46Read f i l e <27734.err> f o r stderr output of t h i s job . 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [44 / 45] 联系信息 • 中国科学技术大学超算中心: 办公室科大东区新图书馆一楼东侧 126 室电话:0551-63602248 信箱:sccadmin@ustc.edu.cn 主页:http://scc.ustc.edu.cn 异构计算及 CUDA 程序编译运行简介 CUDA 计算与程序编译运行 CUDA 程序编译 [45 / 45]