Accelerating Deep Neural Networks

Aumit Leon
6 min readAug 10, 2019

A retrospective on my undergraduate thesis work in Computer Science at Middlebury College.

As part of my undergraduate thesis work, I explored the how deep neural networks (DNNs), particularly convolutional neural networks, could be accelerated via parallelization. The full text of my thesis can be found here. Writing this thesis allowed me to focus on an area of machine learning that I find incredibly interesting:

How can machine learning methods such as DNNs scale in a way that facilitates training increasingly complex models on less powerful hardware?

DNNs are wildly popular, to the point of becoming a buzzword. Deep learning methods have been successfully applied to a range of tasks, including speech recognition, superhuman game playing, and visual recognition tasks. Scaling DNNs to work on less powerful hardware would yield several important benefits.

For starters, a small subset of people interested in problems surrounding DNNs have access to a large compute cluster of GPUs to train their models, like the one used to train AlphaGo. Being able to train these methods on less powerful hardware is possible if we can effectively increase the efficiency of these model’s usage of compute devices. By making better use of less powerful hardware, DNNs can be democratized in a way that allows a wider range of engineers, data scientists, researchers and machine learning enthusiasts to develop more complex and powerful models that can challenge the status quo and redefine the state-of-the-art. A diversity of perspectives is necessary in machine learning, especially as the field comes under fire for perpetuating biases in various applications. The wider software industry continues to see the benefits of this form of democratization — software engineering was once restricted to those who had access to powerful computers, but as computers became cheaper and more accessible and more people were able to contribute to the field, the quality of our software increased.

The ability to train DNN models on less powerful hardware implies that researchers can also build larger, more complex models than ever before. While the addition of more layers or the use of more features may not necessarily yield more than an incremental improvement on existing methods, optimizing the use of compute resources such as GPUs and specialized hardware such as TPUs and FPGAs while training DNNs allows researchers to explore a wider range of model architectures. Optimizing the use of our tools is still an open problem in computer science and information systems— in 1987 Robert Solow won the Nobel prize for describing the relationship between innovation and economic growth. Solow would tell the New York Times, “you can see the computer age everywhere but in the productivity statistics.

In the 1980s, technological innovation was soaring but workplace productivity was stagnating. Solow described this productivity paradox as a function of companies failing to effectively adopt and optimize their use of emerging technologies. The same principles can be applied to the nascent field of deep learning: we have to make better use of hardware innovations (particularly around GPUs) in order to increase the effectiveness of DNNs and related methods.

How, then, do we make better use of hardware innovations?

For the 10 weeks that I had to work on my thesis, I explored methods that accelerated the training of DNNs via parallelism. Optimizing the use of compute devices such as GPUs during training would naturally lead to shorter training times.

I chose to focus my research on image recognition tasks. It’s easy to get lost in the breadth of fields where deep learning methods are increasingly defining the state-of-the-art, but I wanted to dive into the depths of a particular application. Image recognition was a natural choice primarily because AlexNet’s performance on the Large Scale Visual Recognition Challenge was one of the first successful, breakthrough applications of deep learning.

The Large Scale Visual Recognition Challenge was a competition built around ImageNet, a dataset of over 14 million labeled images. As the winner of the competition in 2012, AlexNet was a serious display of the power of deep learning methods. My research focused on the ways in which the convolutional neural network architecture of AlexNet could be parallelized during training. I focused on data parallelism, model parallelism, a hybrid method specifically designed for AlexNet known as the One Weird Trick (OWT) method, and a generalizable frameworks called FlexFlow.

Parallelism Methods

Data parallelism splits equal sized subsets of the data across each compute device. After the training algorithms are applied to each subset on each device, the results are aggregated between training epochs.

Data Parallelism

Thus, if you have 300,000 data points and 3 GPUs, each device would would run a full replica of the model architecture on 100,000 of the data points before aggregating the results.

In contrast, model parallelism splits the model architecture itself across each available compute device, and runs the full dataset through each device.

Model Parallelism

By splitting the model architecture across the compute devices, each device is responsible for an equal sized subset of the total weights matrix. If the weights matrix is 900 x 900, each GPU is responsible for 900 x 300 weights from the matrix. Similarly, all the intermediate results are aggregated.

Alex Krizhevsky’s OWT is a hybrid method that combines model and data parallelism specifically for the AlexNet architecture. In particular, OWT uses data parallelism for fully connected layers, and model parallelism for convolutional layers.

Data and model parallelism are generalizable methods that can be applied to a wide range of neural network architectures, but OWT was designed specifically for AlexNet. FlexFlow was the model agnostic, generalizable parallelization framework that I explored — in the literature, FlexFlow was shown to out perform data and model parallelism, as well as OWT when testing with AlexNet trained on the ImageNet corpora.

Parallelism and Deep Learning

Among the methods I implemented on my own, I found that the OWT method exhibited stronger performance when the input data more closely matched ImageNet’s images, particularly in size. The nuances of why one method performs better than another are still an open question for me, and something I plan on spending more time exploring. To read the details of my experiments and my analysis, refer to my full thesis.

In the spirit of democratizing machine learning methods, I believe that generalizable frameworks are the future of the field. Not only do they outperform general methods like model and data parallelism, they have also been shown to out perform model specific, expert designed methods, such as OWT. These frameworks are are extensible and takeaway the overhead of deducing a parallelization strategy that optimizes the use of compute resources. With more research and open source contributions, frameworks like FlexFlow can help build the next generation of state-of-the-art deep learning methods.

Conclusions

Parallelization methods are a powerfully effective way to optimize overall device usage during model training. By parallelizing computation during training, DNN training times are reduced, which enables the development of larger, more powerful models. While various parallelism methods exist, generalizable frameworks are the most effective and efficient tools available to researchers right out of the box.

Parallelism should be baked into the APIs we use to build machine learning models. No matter the API, be it TensorFlow, PyTorch, Keras, or the new flavor of the month, these tools should be built with parallelism in mind. Most APIs do a decent job of supporting general use cases out of the box, but truly efficient device usage via parallelism requires additional overhead. In the same way that programming languages and constructs lowered the barrier of entry into software engineering and improved the overall quality of all software in the process, parallelism enabled tools and efficient device usage will improve the quality of deep learning methods while making machine learning research more accessible.

--

--