520 : PyData :: TensorFrames : Apache Spark

sfungphd
Jan 18, 2018
2 min read

Can you guess why this analogy (kind of) works??

BRIDGES! Groan if you must and get it out of your system. Then read on. :)

Last night, I ventured out to the glittering environs of Bellevue (yes, via the 520 bridge) to attend my first PyData meetup. The evening's topic? An intro to TensorFlow, it's application in deep learning, and how to use TensorFrames so TensorFlow can play more nicely with Apache Spark. The speakers are seasoned professionals in the field and moved through the topics at a refreshingly quick pace.

What I love about artificial neural networks (ANNs) is the beautiful analogy between the computational models and the biological units that they are based on. The simplest networks are fully connected between layers (multilayer perceptron, MLP) and of course, they get more complex from there. But at the core, you have neurons that receive input (analogous to input received at dendrites), each output is computed/aggregated via weighted sum (analogous to inputs being aggregated in the cell body, aka soma), and depending on the activation function (tanh, sigmoid, or relu), the aggregated sum is moved forward to the next neuron (analogous to determining if the aggregated input passes a threshold that allows the axon to "fire" by sending an electrical signal down to the synapse), with this process continuing to the output layer where a classification is made, text output is generated, etc.. Deep learning refers to ANNs that have many hidden layers (the ones between the input and output layers).

The motivation behind TensorFrames is a communication issue. See where we need some...bridges?

Briefly, tungsten binary format (tbf) is meant to improve performance issues. A quick search finds that the goal of Apache's Project Tungsten was to bring Spark closer to bare metal (the hardware). Putting it together, data is formatted in a way that brings performance as close to the limits of your computational power (CPUs and GPUs). Above, you see that there are a number of steps involving tbf, Java, Python, and the underlying C++ of TensorFlow.

Bring in TensorFrames!

I think it would be a more complete picture if there was a Spark box sandwiching the other end of the TensorFlow box (a bridge is for 2-way traffic, right?). Essentially, the idea is Spark takes a dataframe in, TensorFrames serve as a helper to feed into TensorFlow and serves the output back out as a dataframe, removing the need for Python pickling.

The rest of the evening was devoted to a short tutorial on what deep learning is capable of, using the MNIST training set, as well as a live demo of visualizing what individual neurons of a deep learning model are using as training data (Deep Visualization Toolbox).This repo might be useful get the toolbox up and running, though I have not personally tried this yet. Fun and flashy, but not integral to the rest of the talk.

In this age of open source material and the inclusivity of the Python community, the slides and iPython notebooks that go along with the presentation (including slides that I've included for reference) are on Github here. (Is Python the answer in combatting the dreaded Seattle Freeze? I say, maybe!)

Susan Fung, PhD

520 : PyData :: TensorFrames : Apache Spark

Comments