Considerations for NoSQL and Map Reducing Machine Learning Algorithms

The Machine Is Learning

The past couple of weeks have been rather tumultus for me and several others.  I wont go into details and “kiss and tell” but suffice to say it is has led me to the land of “Free Agency”.

As such over the past couple of weeks I have met several people and have been discussing items such as Big Data, Semantics, Hadoop, NoSQL, Data Science <<insert favorite bingo buzzword here>>. One issue that I see over and over concerns the usage of different distributed compute frameworks. Saying “Hadoop” is not a panacea for your Big Data problems.  One must understand the problem your trying to solve.  Many people fall prey to the “new shiny thing” paradigm.  On the issue of BigData concerns if you want to scale horizontal go with a NoSQL solution – if it is necessary.  The NoSQL database is implemented as a data grid for processing (mapReduce, queries, CRUD, etc.).  Most big websites and players that have moved towards non-relational datastores include LinkedIn, Amazon, Digg and Twitter. One of the main reasons for using NoSQL solutions is relational databases place computation on reads, which is considered wrong for large-scale web applications such as Digg etc.

Aspects of this behavior are:
• The serial nature of applications1 often waiting for I/O from the data store which does no good to scalability and low response times.
• Huge amounts of data and a high growth factor lead Twitter towards facilitating Cassandra, which is designed to operate with large scale data.
• Furthermore, operational costs of running and maintaining systems like Twitter et al escalate. Web applications of this size therefore “need a system that can grow in a more automated fashion and be highly available.”.

Amazon’s James Hamilton agrees:

“Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS.”

Ok so lets assume you have your high availability horizontal scale framework (HAHSF)  in place and you are harvesting, ingressing and egressing data.  Now what?  You must figure out the DataScience strategy and what to do with the data.  Otherwise its like hoarding – your just harvesting and storing it, which by definition means that you need to do some type of statistics, machine learning or data mining.  Here is where I believe most people get slightly sideways.  Map Reduction is not Hadoop.  Map Reducing of Machine Learning algorithms are at the forefront of what is occurring within scale applications.

Machine Learning Review

For a refresher I would suggest breaking out that Grey Book called Machine Learning by Professor Mitchell or Haykin’s, Neural Network tome.  So lets start with something we all probably learned in one of your AI or machine learning classes.  Remember the Back-Propagation Network?  If not here is the diagram for a refresher:

Essentially the system is a linear combination of weights with a trigonometric function that provides a threshold then the error is fed back the the weights are changed.  There are three types of neurons in a neural network that is created BP ANN algorithm:
  • Input neurons

For discrete input attributes, an input neuron typically represents a single state from the input attribute. This includes missing values, if the training data contains nulls for that attribute. A discrete input attribute that has more than two states generates one input neuron for each state, and one input neuron for a missing state, if there are any nulls in the training data. A continuous input attribute generates two input neurons: one neuron for a missing state, and one neuron for the value of the continuous attribute itself. Input neurons provide inputs to one or more hidden neurons.

  • Hidden neurons

Hidden neurons receive inputs from input neurons and provide outputs to output neurons.

  • Output neurons

Output neurons represent predictable attribute values for the data mining model. For discrete input attributes, an output neuron typically represents a single predicted state for a predictable attribute, including missing values. For example, a binary predictable attribute produces one output node that describes a missing or existing state, to indicate whether a value exists for that attribute. A Boolean column that is used as a predictable attribute generates three output neurons: one neuron for a true value, one neuron for a false value, and one neuron for a missing or existing state.

A neuron receives input from other neurons, or from other data, depending on which layer of the network it is in. An input neuron receives inputs from the original data. Hidden neurons and output neurons receive inputs from the output of other neurons in the neural network. Inputs establish relationships between neurons, and the relationships serve as a path of analysis for a specific set of cases.

Each input has a value assigned to it, called the weight, which describes the relevance or importance of that particular input to the hidden neuron or the output neuron. The greater the weight that is assigned to an input, the more relevant or important the value of that input. Weights can be negative, which implies that the input can inhibit, rather than activate, a specific neuron. The value of each input is multiplied by the weight to emphasize the importance of an input for a specific neuron. For negative weights, the effect of multiplying the value by the weight is to deemphasize the importance.

Each neuron has a simple non-linear function assigned to it, called the activation function, which describes the relevance or importance of a particular neuron to that layer of a neural network. Hidden neurons use a hyperbolic tangent function (tanh) for their activation function, whereas output neurons use a sigmoid function for activation. Both functions are nonlinear, continuous functions that allow the neural network to model nonlinear relationships between input and output neurons.

Map Reducing The Algorithm

Now here is where the world of real time systems and distributed systems collide.  Suppose we have a epoch and we have an error propagated back and we need to update the weights?  We need to map reduce this functionality to fully gain benefit from the map reduction – hadoop-it-izing magic.  As a specific note the performance that results depends intimately on the design choices underlying the MapReduce implementation, and how well those choices support the data processing pattern of the respective Machine Learning algorithm.

However, not only does Hadoop not support static reference to shared content across map tasks, as far as I know the implementation also prevented us from adding this feature. In Hadoop, each map task runs in its own Java Virtual Machine (JVM), leaving no access to a shared memory space across multiple map tasks running on the same node. Hadoop does provide some minimal support for sharing parameters in that its streaming mode allows the distribution of a file to all compute nodes (once per machine) that can be referenced by a map task. In this way, data can be shared across map tasks without incurring redundant transfer costs. However, this work-around requires that the map task be written as a separate applications, furthermore, it still requires that each map task load a distinct copy of its parameters into the heap of its JVM.  That said almost all ML learning algorithms and particularly ANN are commonly done with stochastic or direct gradient descent/ascent (depending on sign), which poses a challenge to parallelization.

The problem is that in every step of gradient ascent, the algorithm updates a common set of parameters (e.g. the update matrix for the weights in the BPANN case). When one gradient ascent step (involving one training sample) is updating W , it has to lock down this matrix, read it, compute the gradient, update W , and finally release the lock. This “lock-release” block creates a bottleneck for parallelization; thus, instead of stochastic gradient ascent many methods use a batch method which greatly slows the process down and takes away from the real time aspect.  Work arounds that I have personally seen used and help create are memory mapped shared functions (think signal processing and CPU shared memory modules) which allows a pseudo persistence to occur and the weights to be updated at and paralleled at “epoch time”.   As we reduce the latencies of the networked systems we are starting to get back to some of the very problems that plagued real time parallel signal processing systems – the disc i/o and disc read-write heads are starting to become the bottlneck.  So you move to solid state devices or massive in memory systems.  That does not solve the map reduction problem of machine learning algorithms.  That said over the past years the Apache Foundation and more specifically the Mahout project has been busy adding distributed algorithms to its arsenal.  Of particular interest is the Singular value decomposition which is the basis kernel for processes such as Latent Semantic Indexing and the like.  Even given the breadth of additions fundamental computational blocks such as Kohonen networks and the algorithm mentioned here – Backpropagation are not in the arsenal.  Some good stuff such as an experimental Hidden Markov Model are there.

That said I really do believe we will start to see more of a signal processing architecture on the web.  Just maybe “cycle-stealing” via true grids will make comeback?

Go Big Or Go Home!

@jaxsoncreole

Probablistic Databases For Predictive Content

The Other PDF To You

The Probability Something Will Happen

 

Digital Remote for Your Life

Well folks we are going to shift gears here a little and get back to some hardcore Ideas2Bank discussions concerning technologies.  As of late I have been interested once again in Finding-Not-Searching types of behaviors.  Affinity based systems are once again on the rise.  I will go so far as to say the Age of Affinity is here.  TechCrunch did a writeup recently concerning relevance.  At the end it turned into a pitch for Quora.  However it did have some good ideas concerning continuum of Personalization functionality from complete serendipity to exact personalized context aware information constructs based on geo-location.  I have always been a fan of “lean back” technologies.   Technologies that essentially enable a digital remote for your life.  These types of systems have two common themes: 1) ease of use 2) The probability of usage

Predictive Content and Probability

In today’s world we are trying to create predictive, context aware systems based on the wrong models.  For today’s database architecture: 1) An item either is in the database or is not 2) A tuple either is in the query answer or is not.   This applies to all state-of-the-art data models across the board.  In a probablistic database we have a different construct altogether which is a better fit for the flow of content. For a content prediction event driven system we can assume the events are precise and have no ambiguity. Instead, it is the future event stream that is unknown and the matching of a pattern at some point in the future is predicted with some probability.   When, Where and How are the operatives for this type of predictive event, f(Wh,Wr,H) if you will.  Also note I mentioned the word stream.   I believe given the current and future infrastructures for processing we are bringing back some of the same analogies for large array signal processing frameworks.  The probablistic database models set up extremely well for these types of event processing mechanics.

For a probabilistic database we have:

  • “An item belongs to the database” is a probabilistic event.
  • “A tuple is an answer to a query” is a probabilistic event
  • Can be extended to all data models; we discuss only probabilistic relational data

Probabilistic databases distinguish between the logical data model and the physical representation of the data much like relational databases do in the ANSI-SPARC Architecture. In probabilistic databases this is even more crucial since such databases have to represent very large numbers of possible worlds, often exponential in the size of one world (a classical database).  In complex event processing systems, events from the environment are correlated and aggregated to form higher level events. Uncertainty in the events may be due to a variety of factors including imprecision in the event sensors or generators (eg streams), and corruption of the communication channel possibly dropping events, which can be measured with entropy metrics.  These attributes lend themselves well to fusion systems and the social stream architectures.   Given we are looking at heterogenous data sources that set up for collisions and data source integrity these types of databases hold great promise.    In addition many of these types of database architectures build upon Finite State Machine mechanics for event processing in operating systems.  Of further interest the data is usually imprecise.

Probalistic Databases address types of imprecision whereas:

  • Data is precise, query answers are imprecise
  • User has limited understanding of the data
  • User has limited understanding of the schema
  • User has personal preferences

Notice a “trend” here?  This sets up very well for content flow predictions.  In addition these types of systems hold well for principled semantics for complex queries.  This provides context for the queries where the data is usually imprecise.  Data integration and data hygiene are paramount in social stream systems.  Where data accuracy is important most companies spend 85% of workload cleansing data.  We could use probabilistic information to reason about soundness, completeness, and overlap of sources (think linked data here).  I have listed some of the main sources of research in Probabilistic databases herewith.  As far as I know there are no publicly commercial applications as of yet for this technology.  My bet is we will see some very soon integrated with some of the other NoSQL like technologies.

For a list of current research projects see:

Until Then,

Go Big Or Go Home!

@jaxsoncreole