dgx性能测试

更换docker源

docker默认镜像拉取地址为国外仓库下载速度较慢，会超时，换成国内镜像。
sudo vim /etc/docker/daemon.json

1	/etc/docker/daemon.json

重启服务
sudo service docker restart

拉取docker服务

https://hub.docker.com
在这边搜索，或者直接用 docker serach XXX
利用命令 nvidia-docker pull 名字:tag
拉取镜像

删除docker实例
nvidia-docker rm c8a4bf012268

停止docker实例
docker stop a004f2b5888a

停止后再启动
docker start a004f2b5888a

退出后重连
docker attach a004f2b5888a

##使用docker

1	nvidia-docker run -it -v /raid/chenshi:/chen_data --name chen_keras_tf nvcr.io/nvidia/tensorflow:18.03-py3 bash

nvidia-docker run
-it
-v /raid/chenshi:/chen_data #映射主机磁盘到docker /源地址：docker地址
–name chen_keras_tf #别名
nvcr.io/nvidia/tensorflow:18.03-py3 @镜像名:tag
bash

如果需要隔离显卡，在run前面，加上对应代码即可
NV_GPU=0,1,4 nvidia-docker run…..

这边使用时，请注意加上别名区分，如果环境有改变，可以考虑提交成镜像，共下次自己使用，方法如下：

nvidia-docker commit 2b1a57a3cb4c chen_keras:v1.0
其中2b1a57a3cb4c是容器id  chen_keras是镜像名字   v1.0是tag标签

然后再save成镜像
nvidia-docker chen_keras:v1.0 > chen_keras.tar

##注意
1、run之前，先考虑好是否需要隔离显卡
2、注意磁盘的映射，建议在某个文件夹下面建议自己的文件夹，整体映射
3、建议利用names区分是谁的镜像，以防混乱

##测试

###测试1 plant_seedlings：keras
K80：单个Epoch 216s 2s/step
dgx：单个Epoch 36s 285ms/ste
加速比例:216/36=6

###测试2 neural-style-master
imagenet-vgg-verydeep-19.mat 做迭代
https://github.com/anishathalye/neural-style
对同一张图片进行1000次迭代：
dgx:208s
k80:1451s
加速比例：1451/208=6.975962

###测试3 mnist_mlp
dgx:
1：2s 26us/step
2: 2s 38us/step
4：3s 55us/step
8：6s 104us/step

k80:
1：2s

任务太小，近乎没有加速

###测试4 Dog_breed_identification

DGX：
InceptionV3：321.28376555
Xception：315.984331607
InceptionResNetV2：344.929951

k80：
InceptionV3：737.23624
Xception：1092.263
InceptionResNetV2：1374.97

加速比例：2~3倍

###测试5 cifar-10
k80:
1：step 180, loss = 3.86 (4249.7 examples/sec; 0.030 sec/batch)
2：step 15560, loss = 0.72 (8375.4 examples/sec; 0.015 sec/batch)
4：step 180, loss = 3.72 (17076.5 examples/sec; 0.007 sec/batch)

dgx:
1:step 880, loss = 2.52 (11222.9 examples/sec; 0.011 sec/batch)
2:step 1490, loss = 1.80 (15099.7 examples/sec; 0.008 sec/batch)
4:step 310, loss = 3.59 (15553.4 examples/sec; 0.008 sec/batch)
8:step 2760, loss = 1.17 (14002.3 examples/sec; 0.009 sec/batch)

多卡加速效果不明显。单卡在三倍左右的效率。

###测试6 强化学习任务
k80:
6ms/step
dgx:
2ms/step

单卡效率3倍左右

###总结
1 本次对比和k80服务器进行对比，从结果来看，单卡效率远超过k80，某些任务下，能达到k80的7倍左右。
2 在多卡协作对比时，k80服务器能较好的保持线性效率的增长，但是在dgx上，多卡协作，效率没有多大提升。猜测：①单卡本身能保持较快的计算速度，所以在多卡时，其他时延成为了瓶颈 ②有可能docker优化不够充分。

% load the data
trainLen = 3000;
testLen = 1000;
initLen = 100;

data = load('MackeyGlass-t17.txt');

% plot some of it
% figure(10);
% plot(data(1:1000));
% title('A sample of data');

% generate the ESN reservoir
inSize = 1; outSize = 1;
resSize = 500;
a = 0.3; % leaking rate

%rand( 'seed', 42 );
% Win = (rand(resSize,1+inSize)-0.5) .* 1;
% W = rand(resSize,resSize)-0.5;
% Option 1 - direct scaling (quick&dirty, reservoir-specific):
% W = W .* 0.13;
% Option 2 - normalizing and setting spectral radius (correct, slower):
% disp 'Computing spectral radius...';
% opt.disp = 0;
% rhoW = abs(eigs(W,1,'LM',opt));
% disp 'done.'
% W = W .* ( 1.25 /rhoW);

%%% the fourth question %%%
r_i = .5;
r_c = .8;
r_j = .7;
jump_l = 10;
load('signv.mat');

Win = ones(resSize,1+inSize) .* r_i;
W = zeros(resSize, resSize);

W(1, resSize) = r_c;
for i = 1 : resSize
    if(i + 1 <= resSize)
        W(i + 1, i) = r_c;
    end
end

for i = 1 : jump_l : resSize
    if(i + jump_l < resSize)
        W(i, i + jump_l) = r_j;
        W(i + jump_l, i) = r_j;
    end
end


% allocated memory for the design (collected states) matrix
X = zeros(1+inSize+resSize,trainLen-initLen);
% set the corresponding target matrix directly
Yt = data(initLen+2:trainLen+1)';

% run the reservoir with the data and collect X
x = zeros(resSize,1);
for t = 1:trainLen
	u = data(t);
	x = (1-a)*x + a*tanh( Win.*sign(:,1:2)*[1;u] + W*x );
	if t > initLen
		X(:,t-initLen) = [1;u;x];
	end
end

% train the output
reg = 1e-8;  % regularization coefficient
X_T = X';
% Wout = Yt*X_T * inv(X*X_T + reg*eye(1+inSize+resSize));
Wout = Yt*X_T / (X*X_T + reg*eye(1+inSize+resSize));
% Wout = Yt*pinv(X);

% run the trained ESN in a generative mode. no need to initialize here, 
% because x is initialized with training data and we continue from there.
Y = zeros(outSize,testLen);
u = data(trainLen+1);
for t = 1:testLen 
	x = (1-a)*x + a*tanh( Win.*sign(:,1:2)*[1;u] + W*x );
	y = Wout*[1;u;x];
	Y(:,t) = y;
	% generative mode:
	u = y;
	% this would be a predictive mode:
	%u = data(trainLen+t+1);
end

errorLen = 1000;
mse = sum((data(trainLen+2:trainLen+errorLen+1)'-Y(1,1:errorLen)).^2)./errorLen;
disp( ['MSE = ', num2str( mse )] );