dgx性能测试


更换docker源

docker默认镜像拉取地址为国外仓库下载速度较慢,会超时,换成国内镜像。
sudo vim /etc/docker/daemon.json

1
/etc/docker/daemon.json

重启服务
sudo service docker restart

拉取docker服务

https://hub.docker.com
在这边搜索,或者直接用 docker serach XXX
利用命令 nvidia-docker pull 名字:tag
拉取镜像

删除docker实例
nvidia-docker rm c8a4bf012268

停止docker实例
docker stop a004f2b5888a

停止后再启动
docker start a004f2b5888a

退出后重连
docker attach a004f2b5888a

##使用docker

1
nvidia-docker run -it -v /raid/chenshi:/chen_data  --name chen_keras_tf nvcr.io/nvidia/tensorflow:18.03-py3 bash

nvidia-docker run
-it
-v /raid/chenshi:/chen_data #映射主机磁盘到docker /源地址:docker地址
–name chen_keras_tf #别名
nvcr.io/nvidia/tensorflow:18.03-py3 @镜像名:tag
bash

如果需要隔离显卡,在run前面,加上对应代码即可
NV_GPU=0,1,4 nvidia-docker run…..

这边使用时,请注意加上别名区分,如果环境有改变,可以考虑提交成镜像,共下次自己使用,方法如下:

1
2
3
4
5
nvidia-docker commit 2b1a57a3cb4c chen_keras:v1.0
其中2b1a57a3cb4c是容器id chen_keras是镜像名字 v1.0是tag标签

然后再save成镜像
nvidia-docker chen_keras:v1.0 > chen_keras.tar

##注意
1、run之前,先考虑好是否需要隔离显卡
2、注意磁盘的映射,建议在某个文件夹下面建议自己的文件夹,整体映射
3、建议利用names区分是谁的镜像,以防混乱

##测试

###测试1 plant_seedlings:keras
K80:单个Epoch 216s 2s/step
dgx:单个Epoch 36s 285ms/ste
加速比例:216/36=6

###测试2 neural-style-master
imagenet-vgg-verydeep-19.mat 做迭代
https://github.com/anishathalye/neural-style
对同一张图片进行1000次迭代:
dgx:208s
k80:1451s
加速比例:1451/208=6.975962

###测试3 mnist_mlp
dgx:
1:2s 26us/step
2: 2s 38us/step
4:3s 55us/step
8:6s 104us/step

k80:
1:2s

任务太小,近乎没有加速

###测试4 Dog_breed_identification

DGX:
InceptionV3:321.28376555
Xception:315.984331607
InceptionResNetV2:344.929951

k80:
InceptionV3:737.23624
Xception:1092.263
InceptionResNetV2:1374.97

加速比例:2~3倍

###测试5 cifar-10
k80:
1:step 180, loss = 3.86 (4249.7 examples/sec; 0.030 sec/batch)
2:step 15560, loss = 0.72 (8375.4 examples/sec; 0.015 sec/batch)
4:step 180, loss = 3.72 (17076.5 examples/sec; 0.007 sec/batch)

dgx:
1:step 880, loss = 2.52 (11222.9 examples/sec; 0.011 sec/batch)
2:step 1490, loss = 1.80 (15099.7 examples/sec; 0.008 sec/batch)
4:step 310, loss = 3.59 (15553.4 examples/sec; 0.008 sec/batch)
8:step 2760, loss = 1.17 (14002.3 examples/sec; 0.009 sec/batch)

多卡加速效果不明显。单卡在三倍左右的效率。

###测试6 强化学习任务
k80:
6ms/step
dgx:
2ms/step

单卡效率3倍左右

###总结
1 本次对比和k80服务器进行对比,从结果来看,单卡效率远超过k80,某些任务下,能达到k80的7倍左右。
2 在多卡协作对比时,k80服务器能较好的保持线性效率的增长,但是在dgx上,多卡协作,效率没有多大提升。猜测:①单卡本身能保持较快的计算速度,所以在多卡时,其他时延成为了瓶颈 ②有可能docker优化不够充分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
% load the data
trainLen = 3000;
testLen = 1000;
initLen = 100;

data = load('MackeyGlass-t17.txt');

% plot some of it
% figure(10);
% plot(data(1:1000));
% title('A sample of data');

% generate the ESN reservoir
inSize = 1; outSize = 1;
resSize = 500;
a = 0.3; % leaking rate

%rand( 'seed', 42 );
% Win = (rand(resSize,1+inSize)-0.5) .* 1;
% W = rand(resSize,resSize)-0.5;
% Option 1 - direct scaling (quick&dirty, reservoir-specific):
% W = W .* 0.13;
% Option 2 - normalizing and setting spectral radius (correct, slower):
% disp 'Computing spectral radius...';
% opt.disp = 0;
% rhoW = abs(eigs(W,1,'LM',opt));
% disp 'done.'
% W = W .* ( 1.25 /rhoW);

%%% the fourth question %%%
r_i = .5;
r_c = .8;
r_j = .7;
jump_l = 10;
load('signv.mat');

Win = ones(resSize,1+inSize) .* r_i;
W = zeros(resSize, resSize);

W(1, resSize) = r_c;
for i = 1 : resSize
if(i + 1 <= resSize)
W(i + 1, i) = r_c;
end
end

for i = 1 : jump_l : resSize
if(i + jump_l < resSize)
W(i, i + jump_l) = r_j;
W(i + jump_l, i) = r_j;
end
end


% allocated memory for the design (collected states) matrix
X = zeros(1+inSize+resSize,trainLen-initLen);
% set the corresponding target matrix directly
Yt = data(initLen+2:trainLen+1)';

% run the reservoir with the data and collect X
x = zeros(resSize,1);
for t = 1:trainLen
u = data(t);
x = (1-a)*x + a*tanh( Win.*sign(:,1:2)*[1;u] + W*x );
if t > initLen
X(:,t-initLen) = [1;u;x];
end
end

% train the output
reg = 1e-8; % regularization coefficient
X_T = X';
% Wout = Yt*X_T * inv(X*X_T + reg*eye(1+inSize+resSize));
Wout = Yt*X_T / (X*X_T + reg*eye(1+inSize+resSize));
% Wout = Yt*pinv(X);

% run the trained ESN in a generative mode. no need to initialize here,
% because x is initialized with training data and we continue from there.
Y = zeros(outSize,testLen);
u = data(trainLen+1);
for t = 1:testLen
x = (1-a)*x + a*tanh( Win.*sign(:,1:2)*[1;u] + W*x );
y = Wout*[1;u;x];
Y(:,t) = y;
% generative mode:
u = y;
% this would be a predictive mode:
%u = data(trainLen+t+1);
end

errorLen = 1000;
mse = sum((data(trainLen+2:trainLen+errorLen+1)'-Y(1,1:errorLen)).^2)./errorLen;
disp( ['MSE = ', num2str( mse )] );