For discrete input attributes, an input neuron typically represents a single state from the input attribute. This includes missing values, if the training data contains nulls for that attribute. A discrete input attribute that has more than two states generates one input neuron for each state, and one input neuron for a missing state, if there are any nulls in the training data. A continuous input attribute generates two input neurons: one neuron for a missing state, and one neuron for the value of the continuous attribute itself. Input neurons provide inputs to one or more hidden neurons.
Hidden neurons receive inputs from input neurons and provide outputs to output neurons.
Output neurons represent predictable attribute values for the data mining model. For discrete input attributes, an output neuron typically represents a single predicted state for a predictable attribute, including missing values. For example, a binary predictable attribute produces one output node that describes a missing or existing state, to indicate whether a value exists for that attribute. A Boolean column that is used as a predictable attribute generates three output neurons: one neuron for a true value, one neuron for a false value, and one neuron for a missing or existing state.
A neuron receives input from other neurons, or from other data, depending on which layer of the network it is in. An input neuron receives inputs from the original data. Hidden neurons and output neurons receive inputs from the output of other neurons in the neural network. Inputs establish relationships between neurons, and the relationships serve as a path of analysis for a specific set of cases.
Each input has a value assigned to it, called the weight, which describes the relevance or importance of that particular input to the hidden neuron or the output neuron. The greater the weight that is assigned to an input, the more relevant or important the value of that input. Weights can be negative, which implies that the input can inhibit, rather than activate, a specific neuron. The value of each input is multiplied by the weight to emphasize the importance of an input for a specific neuron. For negative weights, the effect of multiplying the value by the weight is to deemphasize the importance.
Each neuron has a simple non-linear function assigned to it, called the activation function, which describes the relevance or importance of a particular neuron to that layer of a neural network. Hidden neurons use a hyperbolic tangent function (tanh) for their activation function, whereas output neurons use a sigmoid function for activation. Both functions are nonlinear, continuous functions that allow the neural network to model nonlinear relationships between input and output neurons.
Map Reducing The Algorithm
Now here is where the world of real time systems and distributed systems collide. Suppose we have a epoch and we have an error propagated back and we need to update the weights? We need to map reduce this functionality to fully gain benefit from the map reduction – hadoop-it-izing magic. As a specific note the performance that results depends intimately on the design choices underlying the MapReduce implementation, and how well those choices support the data processing pattern of the respective Machine Learning algorithm.
However, not only does Hadoop not support static reference to shared content across map tasks, as far as I know the implementation also prevented us from adding this feature. In Hadoop, each map task runs in its own Java Virtual Machine (JVM), leaving no access to a shared memory space across multiple map tasks running on the same node. Hadoop does provide some minimal support for sharing parameters in that its streaming mode allows the distribution of a file to all compute nodes (once per machine) that can be referenced by a map task. In this way, data can be shared across map tasks without incurring redundant transfer costs. However, this work-around requires that the map task be written as a separate applications, furthermore, it still requires that each map task load a distinct copy of its parameters into the heap of its JVM. That said almost all ML learning algorithms and particularly ANN are commonly done with stochastic or direct gradient descent/ascent (depending on sign), which poses a challenge to parallelization.
The problem is that in every step of gradient ascent, the algorithm updates a common set of parameters (e.g. the update matrix for the weights in the BPANN case). When one gradient ascent step (involving one training sample) is updating W , it has to lock down this matrix, read it, compute the gradient, update W , and finally release the lock. This “lock-release” block creates a bottleneck for parallelization; thus, instead of stochastic gradient ascent many methods use a batch method which greatly slows the process down and takes away from the real time aspect. Work arounds that I have personally seen used and help create are memory mapped shared functions (think signal processing and CPU shared memory modules) which allows a pseudo persistence to occur and the weights to be updated at and paralleled at “epoch time”. As we reduce the latencies of the networked systems we are starting to get back to some of the very problems that plagued real time parallel signal processing systems – the disc i/o and disc read-write heads are starting to become the bottlneck. So you move to solid state devices or massive in memory systems. That does not solve the map reduction problem of machine learning algorithms. That said over the past years the Apache Foundation and more specifically the Mahout project has been busy adding distributed algorithms to its arsenal. Of particular interest is the Singular value decomposition which is the basis kernel for processes such as Latent Semantic Indexing and the like. Even given the breadth of additions fundamental computational blocks such as Kohonen networks and the algorithm mentioned here – Backpropagation are not in the arsenal. Some good stuff such as an experimental Hidden Markov Model are there.
That said I really do believe we will start to see more of a signal processing architecture on the web. Just maybe “cycle-stealing” via true grids will make comeback?
Go Big Or Go Home!