Introduction
TensorFlow™ is an open source software library for numerical computation using data flow graphs.
TensorFlow is for everyone. It’s for students, researchers, hobbyists, hackers, engineers, developers, inventors and innovators and is being open sourced under the Apache 2.0 open source license.
Distributed version
In April 13,2006.Distributed version of Tensorflow was published.
Announcing TensorFlow 0.8 – now with distributed computing support!
This blog is also reposted by Jeff Dean in G+ who is the Google Senior Fellow in the Systems Infrastructure Group.
Installation
I try the distributed version in 3 virtual machines on OpenStack platform which is a famous open source cloud OS.
Environment
- 1 master node
- 2 worker nodes
- Configuration:4GB RAM, 2 VCPU, 40.0GB Disk
- Operating System:ubuntu-14.04-server-cloudimg-amd64
- Python version:2.7
- Tensorflow version:r0.8
- vCPU only.
Install pip
- Download get-pip.py file.
|
|
- Install the pip behind a proxy
|
|
- Upgrading pip
|
|
Install Tensorflow by pip
- Ubuntu/Linux 64-bit, GPU enabled. Requires CUDA toolkit 7.5 and CuDNN v4.
|
|
- Ubuntu/Linux 64-bit, CPU only:
|
|
Demo of Distributed TensorFlow
Now we have setup the preparation for a demo. This demo is according to the tutorial of Tensorflow.
https://www.tensorflow.org/versions/r0.8/how_tos/distributed/index.html
Describe of a cluster
A TensorFlow “cluster” is a set of “tasks” that participate in the distributed execution of a TensorFlow graph. Each task is associated with a TensorFlow “server”, which contains a “master” that can be used to create sessions, and a “worker” that executes operations in the graph. A cluster can also be divided into one or more “jobs”, where each job contains one or more tasks.
Two important instrucions
Create a tf.train.ClusterSpec
that describes all of the tasks in the cluster. This should be the same for each task.
Create a tf.train.Server
, passing the tf.train.ClusterSpec to the constructor, and identifying the local task with a job name and task index.
Specific a cluster
We define 2 worker node and 1 parameter server node as the following:
|
|
master node
- Write
ps.py
as the following for parameter server:
|
|
- Run the script
|
|
It will listen on the 2222 port of master0 server.
worker node
In the two worker nodes, do the following steps:
- Write
worker0.py
as the following:
|
|
- Run the script
|
|
It will listen on the 2222 port of worker0 server.
- Write
worker1.py
as the following:
|
|
- Run the script
|
|
It will listen on the 2222 port of worker1 server.
Test Job Task
- The test job script is the following:
|
|
- ps node is to generate normal random variables.
- worker node is to training the model.
- Results
When run the script, it will generate data in device: "/job:ps/task:0"
and train the data in device=/job:worker/task:1
as the following:
|
|
Further Work
This blog is only a try of distributed tensorflow, there are many works need to do. such as how to combined working with GPU, how to benchmark and so on.
Reference
https://github.com/tensorflow/tensorflow
http://googleresearch.blogspot.com/2016/04/announcing-tensorflow-08-now-with.html