The use case
A typical deep learning advance starts out as an idea, which you test on a small problem. At this stage, you want to run many ad-hoc experiments quickly. Ideally, you can just SSH into a machine, run a script in screen, and get a result in less than an hour.
Making the model really work usually requires seeing it fail in every conceivable way and finding ways to fix those limitations. (This is similar to building any new software system, where you’ll run your code many times to build an intuition for how it behaves.)
So deep learning infrastructure must allow users to flexibly introspect models, and it’s not enough to just expose summary statistics.
Once the model shows sufficient promise, you’ll scale it up to larger datasets and more GPUs. This requires long jobs that consume many cycles and last for multiple days. You’ll need careful experiment management, and to be extremely thoughtful about your chosen range of hyperparameters.
The early research process is unstructured and rapid; the latter is methodical and somewhat painful, but it’s all absolutely necessary to get a great result.
The paper Improved Techniques for Training GANs began with Tim Salimans devising several ideas for improving Generative Adversarial Network training. We’ll describe the simplest of these ideas (which happened to produce the best-looking samples, though not the best semi-supervised learning).
GANs consist of a generator and a discriminator network. The generator tries to fool the discriminator, and the discriminator tries to distinguish between generated data and real data. Intuitively, a generator which can fool every discriminator is quite good. But there is a hard-to-fix failure mode: the generator can “collapse” by always outputting exactly the same (likely realistic-looking!) sample.
Tim had the idea to give discriminator an entire minibatch of samples as input, rather than just one sample. Thus the discriminator can tell whether the generator just constantly produces a single image. With the collapse discovered, gradients will be sent to the generator to correct the problem.
The next step was to prototype the idea on MNIST and CIFAR-10. This required prototyping a small model as quickly as possible, running it on real data, and inspecting the result. After some rapid iteration, Tim got very encouraging CIFAR-10 samples=—pretty much the best samples we’d seen on this dataset.
However, deep learning (and AI algorithms in general) must be scaled to be truly impressive—a small neural network is a proof of concept, but a big neural network actually solves the problem and is useful. So Ian Goodfellow dug into scaling the model up to work on ImageNet.
With a larger model and dataset, Ian needed to parallelize the model across multiple GPUs. Each job would push multiple machines to 90% CPU and GPU utilization, but even then the model took many days to train. In this regime, every experiment became precious, and he would meticulously log the results of each experiment.
Ultimately, while the results were good, they were not as good as we hoped. We’ve tested many hypotheses as to why, but still haven’t cracked it. Such is the nature of science.
The vast majority of our research code is written in Python, as reflected in our open-source projects. We mostly use TensorFlow (or Theano in special cases) for GPU computing; for CPU we use those or Numpy. Researchers also sometimes use higher-level frameworks like Keras on top of TensorFlow.
Like much of the deep learning community, we use Python 2.7. We generally use Anaconda, which has convenient packaging for otherwise difficult packages such as OpenCV and performance optimizations for some scientific libraries.
For an ideal batch job, doubling the number of nodes in your cluster will halve the job’s runtime. Unfortunately, in deep learning, people usually see very sublinear speedups from many GPUs. Top performance thus requires top-of-the-line GPUs. We also use quite a lot of CPU for simulators, reinforcement learning environments, or small-scale models (which run no faster on a GPU).
AWS generously agreed to donate a large amount of compute to us. We’re using them for CPU instances and for horizontally scaling up GPU jobs. We also run our own physical servers, primarily running Titan X GPUs. We expect to have a hybrid cloud for the long haul: it’s valuable to experiment with different GPUs, interconnects, and other techniques which may become important for the future of deep learning.
We approach infrastructure like many companies treat product: it must present a simple interface, and usability is as important as functionality. We use a consistent set of tools to manage all of our servers and configure them as identically as possible.
We use Terraform to set up our AWS cloud resources (instances, network routes, DNS records, etc). Our cloud and physical nodes run Ubuntu and are configured with Chef. For faster spinup times, we pre-bake our cluster AMIs using Packer. All our clusters use non-overlapping IP ranges and are interconnected over the public internet with OpenVPN on user laptops, and strongSwan on physical nodes (which act as AWS Customer Gateways).
Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our infrastructure for small- and large-scale jobs, and we’re actively solidifying our toolkit for making distributed use-cases as accessible as local ones.
We provide a cluster of SSH nodes (both with and without GPUs) for ad-hoc experimentation, and run Kubernetes as our cluster scheduler for physical and AWS nodes. Our cluster spans 3 AWS regions—our jobs are bursty enough that we’ll sometimes hit capacity on individual regions.
Kubernetes requires each job to be a Docker container, which gives us dependency isolation and code snapshotting. However, building a new Docker container can add precious extra seconds to a researcher’s iteration cycle, so we also provide tooling to transparently ship code from a researcher’s laptop into a standard image.