A multi-step delay parameter update parallel optimization method for deep neural network
CSTR:
Author:
Affiliation:

(1.School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China; 2.School of Computer Science and Technology,Xi′an Jiaotong University, Xi′an 710049, China)

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    To address the high communication overhead caused by global gradient parameter updates at aggregation nodes in distributed data parallel training of deep neural network (DNN), a parallel optimization method of multi-step delay parameter updates for deep neural network is proposed. Firstly, an adaptive multi-step update interval selection strategy was designed. After completing multiple local iterative parameter updates, node gradients are aggregated to update the global model parameters, reducing the excessive communication overhead caused by frequent gradient aggregation. At the same time, to prevent the local model from deviating from the global model after several local updates, a parameter correction strategy is proposed to ensure the accuracy of model training. Secondly, during gradient aggregation, the gradient tensor is split into several sub-tensors. By combining sub-tensor priority scheduling, communication and computation during gradient aggregation are maximally overlapped, further accelerating the model training process. Finally, on the CIFAR-100 and ImageNet-mini datasets, the proposed method is compared with SSGD, Local SGD training methods. Results show that the proposed method can significantly reduce communication overhead due to parameter updating on the basis of ensuring model training accuracy. It can maximize the overlap of communication and computing, and make full use of computing resources to improve the speed of parallel training. The results of this study can provide a new resolution to reduce communication costs in the distributed training process of deep neural network.

    Reference
    Related
    Cited by
Get Citation
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:July 17,2024
  • Revised:
  • Adopted:
  • Online: September 15,2025
  • Published:
Article QR Code