Probablistic Databases For Predictive Content

The Probability Something Will Happen

Digital Remote for Your Life

Well folks we are going to shift gears here a little and get back to some hardcore Ideas2Bank discussions concerning technologies. As of late I have been interested once again in Finding-Not-Searching types of behaviors. Affinity based systems are once again on the rise. I will go so far as to say the Age of Affinity is here. TechCrunch did a writeup recently concerning relevance. At the end it turned into a pitch for Quora. However it did have some good ideas concerning continuum of Personalization functionality from complete serendipity to exact personalized context aware information constructs based on geo-location. I have always been a fan of “lean back” technologies. Technologies that essentially enable a digital remote for your life. These types of systems have two common themes: 1) ease of use 2) The probability of usage

Predictive Content and Probability

In today’s world we are trying to create predictive, context aware systems based on the wrong models. For today’s database architecture: 1) An item either is in the database or is not 2) A tuple either is in the query answer or is not. This applies to all state-of-the-art data models across the board. In a probablistic database we have a different construct altogether which is a better fit for the flow of content. For a content prediction event driven system we can assume the events are precise and have no ambiguity. Instead, it is the future event stream that is unknown and the matching of a pattern at some point in the future is predicted with some probability. When, Where and How are the operatives for this type of predictive event, f(Wh,Wr,H) if you will. Also note I mentioned the word stream. I believe given the current and future infrastructures for processing we are bringing back some of the same analogies for large array signal processing frameworks. The probablistic database models set up extremely well for these types of event processing mechanics.

For a probabilistic database we have:

“An item belongs to the database” is a probabilistic event.
“A tuple is an answer to a query” is a probabilistic event
Can be extended to all data models; we discuss only probabilistic relational data

Probabilistic databases distinguish between the logical data model and the physical representation of the data much like relational databases do in the ANSI-SPARC Architecture. In probabilistic databases this is even more crucial since such databases have to represent very large numbers of possible worlds, often exponential in the size of one world (a classical database). In complex event processing systems, events from the environment are correlated and aggregated to form higher level events. Uncertainty in the events may be due to a variety of factors including imprecision in the event sensors or generators (eg streams), and corruption of the communication channel possibly dropping events, which can be measured with entropy metrics. These attributes lend themselves well to fusion systems and the social stream architectures. Given we are looking at heterogenous data sources that set up for collisions and data source integrity these types of databases hold great promise. In addition many of these types of database architectures build upon Finite State Machine mechanics for event processing in operating systems. Of further interest the data is usually imprecise.

Probalistic Databases address types of imprecision whereas:

Data is precise, query answers are imprecise
User has limited understanding of the data
User has limited understanding of the schema
User has personal preferences

Notice a “trend” here? This sets up very well for content flow predictions. In addition these types of systems hold well for principled semantics for complex queries. This provides context for the queries where the data is usually imprecise. Data integration and data hygiene are paramount in social stream systems. Where data accuracy is important most companies spend 85% of workload cleansing data. We could use probabilistic information to reason about soundness, completeness, and overlap of sources (think linked data here). I have listed some of the main sources of research in Probabilistic databases herewith. As far as I know there are no publicly commercial applications as of yet for this technology. My bet is we will see some very soon integrated with some of the other NoSQL like technologies.

For a list of current research projects see: