What is your data worth? Do you know the worth of your data? I bet if there were a privacy breach I bet you would find out at that moment. Otherwise, what does the data actually do? Information storage, for information’s sake, is a copy it is also hoarding. Information assembled in an intelligent high, affinity fashion is worth a lot these days.
i will say it is worth more than the software that makes it so.
By definition, what does that mean for the software that accesses that data? Is it as important as the actions taken on the data? Is it really about the application itself or the results of the information that the application creates across multiple data stores?
To take this a step further, I believe it is no longer about the data sets in situ but how multiple data sets can be fused together with context, geo-spatial and behavioral information. It is no longer about the software but what the software does with the data – and lots of it.
Unless you live under a rock, you know that distributed map reduction systems are now the norm for any massive-scale compute data-intensive service. However, there is a catch here. Digital lifestyles are one massive data collection opportunity. These digital breadcrumbs are the stuff of congressional hearings on privacy. The problem is how we can place a recurring revenue stream on this data. We inherently know it is the new cash. Distributed data that describes itself is the DataSpace. The DataSpace is the vault or vaults which create useful information. This fluidity of data is a very important concept in the proper functioning of Data as a Service (DaaS).
Knowledge Fusion and Assumptions
Knowledge Fusion is a term that historically draws on many dataspace concepts and techniques developed in other areas such as artificial intelligence (AI), machine discovery, machine learning, search, knowledge representation, semantics, and statistics. It is starkly different from other decision-support technologies in as much as it is not purely retrospective in nature. For example, language-based ad hoc queries and reporting are used to analyze what has happened in the past, answering very specific business questions such as, ‘How many widgets did we sell last week?’ When using these tools, the user will already have a question or hypothesis that requires answering or validation. Knowledge Fusion is very different as it is forward-thinking and aims to predict future events and discover unknown patterns and subsequently build models – these models are then used to support predictive ‘what-if’ analysis, such as, ‘How many widgets are we likely to sell next week?’ based on the context and meaning of the data as it is utilized. KF also allows this context and meaning to infer further linkages from the initial query. How is this knowledge fusion processed and accessed? Why is it needed? Let’s start with some technical assumptions from the BOOM paper:
1. Distributed systems benefit substantially from a data-centric design style that focuses the programmer’s attention on carefully capturing all the critical state of the system as a family of collections (sets, relations, streams, etc.) Given such a model, the state of the system can be distributed naturally and flexibly across nodes via familiar mechanisms like partitioning and replication.
2. The key behaviors of such systems can be naturally implemented using declarative programming languages that manipulate these collections, abstracting the programmer from both the physical layout of the data and the fine-grained orchestration of data manipulation.
Based on these assumptions, I would like to add some observations from my experiences:
3. The more data we have the easier the machine learning and data mining algorithms.
4. Clean Data is paramount. 85% of the time spent on creating value out of data is cleansing and creating views and modeling.
Another platform that I believe will greatly see adoption is the folks at Systap, LLC with BigData. Check it out here. These guys are really looking at the future of fully distributed linked data and massive Resource Description Frameworks. They are also taking into account concerns such as high availability, transactional processing, B+ trees, and sharding. i hope to see this trend in the enterprise.
Revenue Models for DataSpaces
A caveat emptor at this juncture: One of my biggest issues with semantic intelligence, knowledge fusion, knowledge discovery, machine learning and data mining is that people believe it is a MAKE IT SO button! Click here and all your dreams will come true. People believe it will give them business strategy answers or generate the next big thing. Folks y’all still have to think.
While this is all well and good. What do we base the importance of data upon?
- Revenue Per Employer (RPE)?
- Keywords Per Transaction (KPT)?
- Bulk Rate DataLoad (BRD)?
- Business Lift Based Query (BLQ)?
- Affinity Per URI (APU)
- Insert your favorite monetization acronym here…
Securing The Query
An interesting trend is finally happening in the industry. Folks are realizing that you must secure those mixed and mashed queries. At the recent Semantic Technology 2010 Conference, there was a panel discussion on security in the semantic world that specifically dealt with dynamically applying access controls. These technologies allow people to slice and dice the DataSpaces based on proper access control and the meaning of the data streams. For a great white paper and overview, please visit this link.
The ability to secure specific person-meaning queries in an of itself will usher in completely new monetization models for data. I like to call this the Reeses Peanut Butter Cup model. You have chocolate. You have peanut butter. Put them together, and you have something really good. Remember, data does not have calories just consumes bandwidth, disc space, and compute resources – depending on usage – your mileage may vary.
Go Big Or Go Home!