Play distributed Tensorflow with GPU accelerated

2016-04-25

Introduction

TensorFlow™ is an open source software library for numerical computation using data flow graphs.

TensorFlow is for everyone. It’s for students, researchers, hobbyists, hackers, engineers, developers, inventors and innovators and is being open sourced under the Apache 2.0 open source license.

Distributed version

In April 13,2006.Distributed version of Tensorflow was published.

Announcing TensorFlow 0.8 – now with distributed computing support!

This blog is also reposted by Jeff Dean in G+ who is the Google Senior Fellow in the Systems Infrastructure Group.

Installation

I try the distributed version in 3 virtual machines on OpenStack platform which is a famous open source cloud OS.

Environment

1 master node
2 worker nodes
Configuration:4GB RAM, 2 VCPU, 40.0GB Disk
Operating System:ubuntu-14.04-server-cloudimg-amd64
Python version:2.7
Tensorflow version:r0.8
vCPU only.

Install pip

Download get-pip.py file.

1	wget https://bootstrap.pypa.io/get-pip.py

Install the pip behind a proxy

1	python get-pip.py --proxy="[user:passwd@]proxy.server:port"

Upgrading pip

1	pip install -U pip

Install Tensorflow by pip

Ubuntu/Linux 64-bit, GPU enabled. Requires CUDA toolkit 7.5 and CuDNN v4.

1	sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl

Ubuntu/Linux 64-bit, CPU only:

1	sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl

Demo of Distributed TensorFlow

Now we have setup the preparation for a demo. This demo is according to the tutorial of Tensorflow.

https://www.tensorflow.org/versions/r0.8/how_tos/distributed/index.html

Describe of a cluster

A TensorFlow “cluster” is a set of “tasks” that participate in the distributed execution of a TensorFlow graph. Each task is associated with a TensorFlow “server”, which contains a “master” that can be used to create sessions, and a “worker” that executes operations in the graph. A cluster can also be divided into one or more “jobs”, where each job contains one or more tasks.

Two important instrucions

Create a tf.train.ClusterSpec that describes all of the tasks in the cluster. This should be the same for each task.

Create a tf.train.Server, passing the tf.train.ClusterSpec to the constructor, and identifying the local task with a job name and task index.

Specific a cluster

We define 2 worker node and 1 parameter server node as the following:

1
2
3

cluster = tf.train.ClusterSpec({"worker": ["tensorflow-worker0:2222",
                                           "tensorflow-worker1:2222"],
                                "ps": ["tensorflow-master0:2222"]})

master node

Write ps.py as the following for parameter server:

import tensorflow as tf
cluster = tf.train.ClusterSpec({"worker": ["tensorflow-worker0:2222",
                                           "tensorflow-worker1:2222"],
                                "ps": ["tensorflow-master0:2222"]})
server = tf.train.Server(cluster,
                          job_name="ps",
                          task_index=0)
server.join()

Run the script

# python ps.py 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {tensorflow-worker0:2222, tensorflow-worker1:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222

It will listen on the 2222 port of master0 server.

worker node

In the two worker nodes, do the following steps:

Write worker0.py as the following:

import tensorflow as tf
cluster = tf.train.ClusterSpec({"worker": ["tensorflow-worker0:2222",
                                           "tensorflow-worker1:2222"],
                                "ps": ["tensorflow-master0:2222"]})
server = tf.train.Server(cluster,
                          job_name="worker",
                          task_index=0)
server.join()

Run the script

# python worker0.py 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {tensorflow-master0:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {localhost:2222, tensorflow-worker1:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222

It will listen on the 2222 port of worker0 server.

Write worker1.py as the following:

import tensorflow as tf
cluster = tf.train.ClusterSpec({"worker": ["tensorflow-worker0:2222",
                                           "tensorflow-worker1:2222"],
                                "ps": ["tensorflow-master0:2222"]})
server = tf.train.Server(cluster,
                          job_name="worker",
                          task_index=1)
server.join()

Run the script

# python worker1.py 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {tensorflow-master0:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {tensorflow-worker0:2222, localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222

It will listen on the 2222 port of worker1 server.

Test Job Task

The test job script is the following:

import tensorflow as tf
with tf.device("/job:ps/task:0"):
    weights0 = tf.Variable(tf.random_normal(shape=[1024, 512]))
    bias0 = tf.Variable(tf.zeros(shape=[512]))
with tf.device("/job:worker/task:1"):
    inputs = tf.random_normal(shape=[10, 1024])
    l0 = tf.nn.relu(tf.matmul(inputs, weights0) + bias0)
    l1 = tf.nn.relu(tf.matmul(l0, tf.transpose(weights0)))
    loss = tf.nn.l2_loss(l1-inputs)
    train_op = tf.train.AdamOptimizer().minimize(loss)
with tf.Session("grpc://tensorflow-master0:2222") as sess:
   for _ in range(1000):
       sess.run(tf.initialize_all_variables())
       _, l = sess.run([train_op, loss])
       print l

ps node is to generate normal random variables.
worker node is to training the model.

Results

When run the script, it will generate data in device: "/job:ps/task:0" and train the data in device=/job:worker/task:1 as the following:

# python job.py 
name: "Adam"
op: "NoOp"
input: "^Adam/update_Variable/ApplyAdam"
input: "^Adam/update_Variable_1/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"
device: "/job:ps/task:0"
 Tensor("L2Loss:0", shape=(), dtype=float32, device=/job:worker/task:1)
name: "Adam"
op: "NoOp"
input: "^Adam/update_Variable/ApplyAdam"
input: "^Adam/update_Variable_1/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"
device: "/job:ps/task:0"
 Tensor("L2Loss:0", shape=(), dtype=float32, device=/job:worker/task:1)
name: "Adam"
op: "NoOp"
input: "^Adam/update_Variable/ApplyAdam"
input: "^Adam/update_Variable_1/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"
device: "/job:ps/task:0"
 Tensor("L2Loss:0", shape=(), dtype=float32, device=/job:worker/task:1)
name: "Adam"
op: "NoOp"
input: "^Adam/update_Variable/ApplyAdam"
input: "^Adam/update_Variable_1/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"
device: "/job:ps/task:0"
...

Further Work

This blog is only a try of distributed tensorflow, there are many works need to do. such as how to combined working with GPU, how to benchmark and so on.

Reference

https://www.tensorflow.org

https://github.com/tensorflow/tensorflow

http://googleresearch.blogspot.com/2016/04/announcing-tensorflow-08-now-with.html