Innovation has slowed to a halt in one of the most critical areas of technology – predictive AI models. This has happened through a breakdown of market dynamics and a dearth of collaborative data science infrastructure.

However, the emergence of cryptoeconomic systems and the nascent open data economy provides a potential breeding ground for a new spurt of growth.

This is the first post in a 2-part series by guest contributor Oaksprout. This post describes the issues we face, what we could unlock by solving the issues and some relevant recent changes to the technological and economic landscape. Part 2 outlines a solution to the problem.

The Machinery of Data Innovation has Ground to a Halt

AI seems to be developing quickly. In reality, it’s grinding to halt under the weight of the very machinery which created it.

Web2 business models have been responsible for profound breakthroughs in machine learning techniques and applications. The model is familiar to most by now:

Centralised platform attracts users with superior features
Platform captures user data, uses it to train a proprietary ML model
Platform deploys ML model, providing even better features
Better features attract more users, repeating process and building market dominance

Examples include Google search, the Facebook news feed algorithm, Amazon product recommendations, countless Apple features etc.

These companies have done incredible work and delivered legitimately new functionality to the world. However, this model is not conducive to major ongoing developments in the application of machine learning. This is because the model closes off access, dampens natural market dynamics and limits collaboration between a diverse set of data scientists.

What’s gone wrong?

Data Fragmentation

Organisations are incentivised to hold user data. Oura’s heart rate variability data is kept separate from Netflix’s watch history. This makes it difficult for machine learning developers to build models on disparate sets of data, where innovative models are more likely to come from.

Data Stagnation

The world’s efforts to pool data over the last 20 years has been extraordinary. Absurd numbers of photos & videos and amounts of content have been generated. Yet the vast majority of it sits stagnant on proprietary databases. That data could be being put to work many, many orders of magnitude more effectively.

Weak Demand Signalling

App developers and businesses regularly stumble across domains which would benefit from reliable predictive models. However, they can’t afford to bring data scientists in-house. Nor do they individually have enough data to train a high-accuracy predictive model. This leads to a fundamental constraint on the features and applications available on the market.

Individuals & Organisations Don’t Think of Data as an Asset

The issues of the current data innovation model have hit mainstream consciousness as evidenced by Netflix documentaries like The Great Hack and The Social Dilemma. However, very few have taken onboard the idea that their personal data is an economic asset.

That problem has been discussed, but I would also argue that many organisations are yet to recognise anything like the full economic value that their amassed data holds.

Both issues are caused by a current lack of data monetisation and value-extraction infrastructure. Without a mechanism for extracting value from their data, it is understandable that holders undervalue it.

As a result of this, the infrastructure for creating data is severaly underdeveloped. Critically, hardware sensors and IoT networks have seen significant underinvestment as a lack of end-monetisable demand stultifies their advancement.

Model Building is Inaccessible

Building predictive models is limited to those who have built moats around vast datasets. Most data scientists are cut out from accessing the data required to build competitive or novel models.

Even for organisations who would like to share their models openly for training, they’re unable to do so due to privacy concerns or legal fees. Even when they do provide their data, discovery is difficult.

Model Building is Wasteful

In the current system, multiple organisations are incentivised to build their own models in parallel. Google, Microsoft, DuckDuckGo, Baidu etc all spend billions to duplicate the same effort. This applies equally to any major predictive algorithm – predictive typing, traffic scheduling, logistics planning – the list goes on.

Despite the large numbers of players rebuilding the same models, these models are likely worse than they could be. As sophisticated as Google’s machine learning appears, what if it was exposed to competition from multiple teams building with the same substrate – the same data? What’s more, what if those teams could be competing to contribute to the same meta-model? More on this later.

The World’s Predictive Capacity is Closed and Monopolised

Despite the incredible innovation the world has seen in its predictive capacity, most of these innovations are closed and cannot be built upon by the world’s developers and businesses. There is no permissionless platform for data scientists to build new models based on existing data, and no permissionless platform for app developers to source powerful predictions from.

In a Nutshell, There’s A Dearth of Cooperative, Equitable Infrastructure

Even if the Big Techs of the world were to decide to commit hara-kiri and make their entire datasets available for free, the degree to which the ecosystem of developers and data scientists could collaborate would still be heavily limited:

there is no easy discovery mechanism by which model builders can find relevant datasets
there is no infrastructure for data scientists to pool and combine their models. There is no technical infrastructure, let alone infrastructure for proportional profit-making based on varying contributions.
model builders can’t find other builders to combine models with

The Economic & Technical Infrastructure for Building Critical AI Models Simply Doesn’t Work Anymore

At this point I believe that Big Tech’s contribution is still large net-positive. However, to address the world’s most significant problems and upgrade the institutions and the worlds’ living standards, the problems I raise above need to be solved. What’s more, if the evolution of machine learning remains in the hands of relatively few individuals and corporations then it will likely optimise along a trajectory out of sync with the interests of the rest of the world.

Own a Share of Google Search; Build Whatever You Want on it

We’ve documented the system, groaning under its own weight. So what – even if we solved it, where would we be? Here are some ideas.

Permissionless Predictions-as-a-Service

By breaking down the constraints to innovation described in the previous section, we open up production and deployment of major existing algorithms which are currently closed off.

This means the algorithms that power the internet’s most used services, but that are strictly controlled by major tech companies, can be improved and woven into anyone’s apps, anywhere. Take, for example:

content search
product recommendations
predictive typing
natural-language processing
translation
computation photography
traffic routing
delivery/logistics routing

In principle – solving the infrastructure issues described in the previous section allows the internet to take any centre-piece algorithm of any major tech company and enable its creation from scratch, and its delivery by any developer in the world.

Equity in Algorithms

We also open up ownership of these algorithms – to anybody with an internet connection and a reasonably well-designed cryptowallet. This leads to granular price discovery, and real-time surfacing of changes in value to these services. In turn, it also enables new business models where developers and even end-consumers can directly benefit from contributing to the building and management of algorithms they use every day.

An Entirely New Breed of Application

Although less easy to predict and visualise, the potentially much larger and more important outcome of solving the innovation issues described above is the new types of application it makes feasible.

It is genuinely counter-productive to attempt to predict the potential applications of a major new technology platform – who would’ve predicted Uber at the onset of the app store, or consumer drones and VR in the early days of mobile technology.

But one can look at the new vectors of innovation that a new platform reveals. What would a platform that solves these issues afford a new army of data science innovators and entrepreneurs?

data release – larger companies are heavily incentivised to relinquish their grip on massive data oceans
much more diverse combinations of data – if data is less siloed and fragmented, models can be trained on much more diverse datasets than ever before
new data creation – as individuals and organisations begin to see the fruits of participating in such a system, they are compelled to generate vastly more data to contribute

Given these revolutionary new vectors of data science innovation, we can take a stab at some new potential predictive models that could emerge—

real-time estimation of impact on health of certain actions
- “What exactly made me sleep like crap last night?”
adaptive learning series
- “Based on what I’ve shown you I know, what should I read next to become a Smart Contract Engineer?”
entertainment
- content recommendations based not just on what shows you’ve previously liked, but based on—
  - every book you’ve ever read
  - your energy levels based on recent activity
  - your entire search history, compared against the search history of everyone else in the world

Frankly I suspect I could come up with much more illustrative ideas, but hopefully these 3 paint a picture.

Outcome-Based Pricing

Finally, what I consider to be the holy grail of an open-data ecosystem – outcome-based pricing. In the same way that Web2 created a revolutionary business model in Saas – software-as-a-service – I believe Web3 can have a similar effect – results-as-a-service.

The idea is simple – rather than paying for access to a tool, you pay for a result – but the impact will be profound.

I believe an open ecosystem of predictive models is a pre-requisite for this new system. Such a system will require networks of intelligent machine agents, connected to a deep and open data substrate, to navigate and orchestrate complex conditional trees and deliver results on behalf of customers.

Deep treatment of the topic eludes this article, but you can read more about it in a post entitled The Outcome Economy written by Rebase founder Jay Bowles.

The Rise of Cryptoeconomics and Open Data Infrastructure

This article is not the first elucidation of the problem. However, as many VCs will tell you, the critical piece in solving any problem is timing. What has happened recently to make this problem solvable for the first time? Below are some ideas.

Cryptowallets Emerge as the First Sovereign Data Store

Many think of cryptowallets such as MetaMask and Bitcoin wallets as a financial innovation first and foremost – “not your keys, not your crypto”. But I believe Trent McConnaghy is more accurate when he uses the generalisation “not your keys, not your data”.

The emergence of cryptowallets is important because it shows demand and capacity for sovereign data custody. Individuals and institutions must begin to take custody for their own data if an open model of predictive capacity is to emerge.

The Franchise Model

The Franchise Model is a design pattern popping up across crypto. In essence The Franchise Model enables infinite scalability of product diversity whilst retaining a high-level governing body. The model pits a governing body at the top who designs, provides resources for and allows smaller bodies to create and manage smaller productive units below.

It’s visible in Polkadot’s parachain/relay chain relationship and Cosmos’ Hub/zone model. I outlined a potential upgrade to THORChain which utilises it. José Maria Macedo detailed a design for Aave deploying it called aDAOs. And Ocean Protocol are moving towards it with the datatoken/Balancer pool model.

I believe this is a paradigm shift in allowing crypto projects to scale permissionlessly and is therefore a critical component enabling crypto-based infrastructure to create predictive models.

A (Ludicrous) Abundance of Prediction Markets

Prediction markets have been on the scene since the early days of Ethereum. Vitalik wrote about futarchy and Joey Krug et al released Augur.

Prediction markets thus far have largely been focused on coordinating human predictions and opinions about various matters and events. For whatever reason – my suspicion is lack of scale and real demand yet – prediction markets have floundered and have not obtained product/market fit. This is despite ongoing iteration to projects like Augur and countless new competitors popping up – Flux and Omen spring to mind.

I believe that although the time will come for these prediction markets, that the focus may better be spent on a simpler, more single-player game – machine-focused prediction markets. Prediction markets that have the unique capacity to be entirely closed-loop – i.e. receive the result digitally – and entirely automated. Many prediction market designs and live projects could quickly be tweaked to make them suitable for this application, particularly those on performant base layers.

In Production: Crypto-Led Predictions and Ensembles

Numerai sources predictions from a sea of data scientists and uses them to build a meta-model. A meta-model is a prediction model which combines lower level prediction models to form a more accurate aggregate. This is known as an ensemble approach – more on this in a moment. Numerai is revolutionary in many ways – it has proven out a system whereby individual data scientists will accept cryptoassets in return for contributing to a process which seems to be working. Numerai uses a staking system which means that data scientists need to deposit purchasing power in order to play, purchasing power which can be taking from them for bad behaviour.

Despite Numerai’s brilliant model, its Richard Craib has stated explicitly (idk may be psyops tho) that he believes there’s limited potential in generalising the technology beyond stock market predictions. I take the exact opposite stance.

Ensembles

An ensemble in machine learning is built on the same simple idea as ‘wisdom of the crowds’. The wisdom of the crowds principle states that if you take the average opinion of a set of people about a particular question, that will get far closer to the correct answer than if you ask any single person at random. This is very similar to the thinking behind markets in general.

Ensembles in machine learning have been shown to work in a similar way. That is, when you take the predictions from a diverse set of algorithms, an ensemble will perform to a much higher level of accuracy than using any one algorithm on its own. The problem to date has been that it’s too expensive to run an ensemble approach – even at a company as big as Netflix. The reason is that it’s too difficult to discover and track the algorithms which contribute positively to the accuracy of the ensemble – the meta-model – and which detract from it.

Here’s a great intro to the topic of ensembles in relation to predictive models by Andrew Simmons.

Crypto – with its granular ownership, staking economics and prediction markets – offers a compelling new set of solutions to these issues. Crypto pushes the burden of deciding whether or not they are additive to the ensemble down to individual model builders. This creates a new, previously impossible degree of efficiency. Numerai has been proving this out with their platform – particularly with the new Signals product.

Tokens and Internet-Native Cooperatives

The fine-grained ownership described above is made possible with recent advancements in internet-first organisations. Advances in the tooling and examples of DAOs have exploded in the past year – nearly every DeFi project has designs on becoming decentralised through tokenisation and some model of internet-native governance.

Not just do these enable governance and decision-making at a new scale, they also permit market activity, price discovery and third-party participation. DAOs make it possible to speculate on the value of different types of prediction and the best models for making those predictions.

On a different note: whilst VCs make much of “10x better” and novel innovations, it’s very rare that an entirely novel area of “10x better” emerges. Fine-grained, permissionless ownership is one of those areas. And in this case, I believe it will now be possible to own a direct stake in the demand for specific types of predictive computation – something currently not even available to the heads of Google, Facebook and Amazon or any of their major investors.

Open Data Sources and Infrastructure

The nascent open data economy is also an extremely powerful unlock for higher-level applications. To date, potential open model builders have access to datasets, powered by private compute-to-data, on Ocean.

Users and businesses increasingly have easy access to the ability to unionise their data – create data DAOs – through services like Streamr. Streamr is also responsible for creating infrastructure for streaming real-time data in a decentralised way.

Blockchain in/out is looking increasingly like a solved problem, with the likes of The Graph pulling data off-chain, and Chainlink and API3 putting it on-chain seamlessly.

On top of that, possibly the most important innovation of them all – the flattener, the standard for all data standards – an open schema/document graph being built out by Ceramic.

What’s the Missing Piece?

In this post we’ve got a sense of—

the stagnation in predictive model building
the incredible things we could have if we improved the system
some promising signs of a new substrate for AI development

In the next post I will outline an idea for a cryptonetwork which leverages this new substrate, aiming to create an upshift in predictive model generation and unimaginable new applications for AI.

Follow me or follow Rebase on Twitter to hear when I release part 2.

A Twinkle in the Crystal Ball – Part 1