Web3 Data Stacks – The Evolution and Challenges

Web3 Data Stacks – The Evolution and Challenges

masterdai

masterdai

Devrel

Ever since Satoshi decided to include a message inside the genesis block, the data structure of the Bitcoin chain has undergone changes.

I started learning blockchain development in 2022, and the first book I read was called 'Mastering Ethereum.' It is a very decent and great book that taught me a lot about the basic knowledge of Ethereum and blockchain fundamentals. From today's perspective, some of the development skills mentioned in this book are outdated. The first step involves running a node on your own laptop, and even for a wallet dApp, you need to download a light node yourself. This reflects the behavior of early developers and hackers in the blockchain development ecosystem from 2015 to 2018.

We didn't have any node service providers back in 2017. From a supply and demand perspective, user activity was limited, and the primary function was transactions. This means that maintaining or hosting a full node oneself did not cost much because there were not many RPC requests to handle, and transfer requests were not frequent.

Most Ethereum early adopters were geeks. These early users had a deep understanding of the blockchain developing and were comfortable maintaining Ethereum nodes, creating transactions, and managing accounts directly through the command line or IDEs.

Therefore, we can observe that early projects usually had a very simple UI/UX. Some of them didn't even have a front end, and user activity was low. This was actually determined by two factors: user behavior and the chain's data structure.

Starting from the node provider.

After more and more users with no coding background joined the blockchain network, the technical structure of decentralized applications changed from users hosting nodes to projects hosting nodes.

figure1-node.png

The reason why people prefer node hosting services is because the on-chain data is growing rapidly, which can result in significant costs for individuals to run a node by themselves.

figure2-txTrend.png

However, for small project developers, hosting nodes by the project team is still a difficult process that requires continuous spending on maintenance and hardware costs. Therefore, this cumbersome process of node hosting is handed over to companies that specialize in maintaining nodes. It is worth noting that the time period when these companies were established on a large scale and financed coincides with the rise of cloud services in the North American technology industry.

figure3-node-year.png

figure4-cloud-computing-googleTrend.png

Just remote hosting of nodes has not solved the problem, especially with the rise of related protocols such as DeFi, NFT, etc. In recent years, developers are facing the issue of data handling. The data provided by the blockchain node itself is what we call raw data, which is not standardized and clean data. The data inside needs to be extracted, cleaned, and loaded.

For example, suppose I am an NFT project builder, and I want to conduct NFT transactions or display NFTs. Then my frontend needs to read the real-time NFT data in individual EOA accounts. NFT is actually just a standardized form of token. Owning an NFT means I own one of the tokens generated by an NFT contract (the ID is unique), and the image of the NFT is actually a metadata, which could be SVG data or a link pointing to an image on IPFS. Although Ethereum's Geth client provides indexing instructions, for some front-end heavy projects, it is unrealistic in engineering terms to continuously request Geth and then return to the front end.

For some features such as orders auction and NFT transaction aggregation, they must be implemented off-chain to collect user instructions and then put them on the chain at the appropriate time.

Therefore, a simple data layer was born. In order to meet the real-time and accuracy requirements of users, the project party needs to build its own database and data parsing functions.

figure5-web3projectstructure.png

How did data indexers evolve?

Starting a project is always easy. You come up with an idea, set up some goals, recruit the best engineers, and build a usable prototype, usually a front end and several smart contracts.

However, it's hard to make it scale. People need to carefully think about the design structure from the first day of building the project. Otherwise, you will soon encounter some problems, a scenario I usually refer to as the 'icing problem'.

figure6-icingproblem.png

I borrowed the term from the Iron Man movie, which amusingly seems very apt for most startup scenarios. When startups soar too high (attracting too many users), they often stumble because they did not anticipate this situation at the outset. In the movie, the protagonist didn't account for the icing problem, as he never expected his war suit to fly into space. Similarly, for many Web3 project developers, their 'icing problem' relates to handling massive adoption. The primary issue they grapple with is a surge in user numbers, which imposes a significant load on the server side. At times, it's an issue with the chain itself, such as network problems or node shutdowns.

figure7-overlod-web3project.png

Most of the time, it's a backend issue. This happens frequently in some blockchain gaming protocols. They did not anticipate such a large number of users playing in the project when they devised a plan to add more servers and hire more data engineers to decode the on-chain data. By the time they realize this, it's too late. And these technical issues cannot be simply fixed by adding more backend engineers. As I mentioned before, these considerations need to be made from the start.

The second problem is about adding a new chain. You might have avoided server-side problems and hired a bunch of decent engineers from the start. However, your users may not be satisfied with the current chains. They want your services to also be deployed on other popular chains, such as zk or L2 chains. Your project structure will eventually end up like this below.

figure8-complex-web3structure.png

In this system, you have full control over your data, which allows for better management and security. The system limits the call requests, reducing the risk of overloading and improving efficiency. And the setup is compatible with your front-end, ensuring seamless integration and user experience.

However, operation and maintenance costs will multiply, which could strain your resources. Every time you add a new chain, repetitive work is required, which can be time-consuming and inefficient. Selecting data from a huge dataset can lower your query time, potentially slowing down processes. Data can become polluted due to blockchain network problems such as rollbacks and reorg, compromising the integrity and reliability of your data.

The design of the project reflects your team members. Adding more nodes and trying to build a heavily backend-oriented system means you need to hire more engineers to operate the nodes and decode the raw data.

This model is similar to the early days of the internet, where e-commerce platforms and app developers opted to construct their own IDC (Internet Data Center) facilities. However, as user requests grow and blockchain networks face state explosions, the costs increase in parallel with the intricacies of program design. Furthermore, this approach hampers rapid market expansion. Certain high-performance public blockchains demand hardware-intensive node operations, while data synchronization and cleansing continuously incur human resources and time costs.

If you are trying to build an NFT marketplace or cool games based on blockchain, isn't it odd that 65% of your team members are backend and data engineers?

Maybe developers wonder why someone can't help them decode and stream those on-chain data, so they can focus on building a better product.

I believe that is the reason why the indexer shows up.

figure9-apiprovider.png

In an effort to reduce barriers to accessing web3 applications and blockchain networks, many developers, including our team, have chosen to integrate archive node maintenance, on-chain data ETL (Extract, Transform, Load), and database invocation steps. These tasks were traditionally managed by project teams, but their integration is now facilitated through the provision of multi-chain data and node APIs.

By utilizing these APIs, users can customize the on-chain data they require. This could range from popular NFT metadata, monitoring a specific address's on-chain behavior, to tracking transaction data in specific token liquidity pools. This approach is what I often refer to as a modern Web3 project structure.

figure10-modern_we3structure.png

The financing and establishment of data layer and indexing layer projects mostly occurred in 2022. I believed that the commercial realization of these indexing layer and data layer projects is closely related to the design of their underlying data architecture, specifically the design of OLAP (On-Line Analytical Processing) systems. The adoption of a core engine is key to optimizing the performance of the indexing layer, including index speed and stability. Commonly used engines include Hive, Spark SQL, Presto, Kylin, Impala, Druid, ClickHouse, and others. ClickHouse, a powerful database widely used by internet companies, was open-sourced in 2016 and secured $250 million in funding in 2021.

figure11-Web3project.jpeg

As a result, the emergence of a new generation of databases and improved data indexing optimization architectures has given rise to the creation of the web3 data indexing layer. This enables companies in this domain to provide data API services in a faster and more efficient manner.

However, the beauty and clarity of the indexer module are currently obscured by two clouds.

Two clouds

The first cloud is related to the impact of the stability of the blockchain network on the server-side. Blockchain networks, while robust, are not immune to instability and inaccuracies. Events like reorgs and rollbacks can occur, posing challenges for indexers.

Reorgs emerge when blockchain nodes lose sync temporarily, leading to the creation of two distinct blockchain versions. Such events can stem from system glitches, network delays, or even malicious activities. Upon re-syncing, nodes converge on a single, official chain, leaving the blocks from the alternate 'fork' discarded.

In the event of a reorg, an indexer might have processed data from blocks that eventually get discarded. Consequently, the indexer must adapt by discarding data from the invalidated chain and reprocessing data from the newly accepted one.

figure12-reorg-eth.png

Such adjustments can escalate resource usage and potentially delay data availability. In severe cases, frequent or large reorgs can significantly undermine the reliability and performance of services depending on the indexer, including web3 applications that utilize data from the indexer's APIs.

The second cloud obscuring the clarity of the indexer module pertains to data format compatibility and the diversity of data standards across different blockchain networks.

In the world of blockchain technology, there is a vast array of different networks, each with its own unique data standards. For instance, there are EVM (Ethereum Virtual Machine) compatible chains, non-EVM chains, and even zk (zero-knowledge) chains, each with their own unique data structures and formats.

For an indexer, this presents a significant challenge. To provide useful and accurate data through its APIs, the indexer must be able to understand and process these diverse data formats. However, there is no universal standard for blockchain data, and different indexers may use different standards for their APIs. This can lead to compatibility issues, where data extracted and transformed by one indexer may not be compatible with the systems used by another.

figure13-toomuch-indexer.png

Furthermore, as developers navigate this multi-chain world, they are often faced with the challenge of dealing with these different data standards. A solution that works for one blockchain network may not work for another, making it difficult to develop applications that can interact with multiple networks.

Indeed, the challenges faced by the blockchain indexing industry are reminiscent of the "two clouds" that Lord Kelvin identified as major unresolved issues in physics in the early 20th century. These "two clouds" eventually gave rise to the revolutionary fields of quantum mechanics and thermodynamics.

In the face of these challenges, one might consider implementing measures such as introducing delays or integrating streaming in the Kafka pipeline, or even establishing a standard league to enhance the blockchain indexing industry. These measures could help manage the instability of blockchain networks and the diversity of data standards, thereby making it easier for indexers to provide accurate and reliable data.

However, just as the emergence of quantum theory revolutionized our understanding of the physical world, we might also consider more radical approaches to improving blockchain data infrastructure. After all, the current infrastructure, with its neatly structured data warehouses and stacks, may seem too perfect and beautiful to be true.

Could there be another way?

Possible Future

Find a pattern

Let's roll back to the first topic about when the node providers and indexers appeared, and ask yourself a peculiar question. Why didn't node operators appear in 2010, and why did indexers explosively finance and establish themselves in 2022?

I believe I have answered part of the questions. It involves the mature application of cloud computing and data warehouse technology in the software industry, outside of the crypto sphere.

Something special also happened in the crypto area, particularly when the ERC20 and ERC721 standards became popular in public media. Additionally, the DeFi summer made on-chain data much more complex. All kinds of call transactions were routed on different smart contracts, rather than just simple transaction data as in the early stages.

figure14-future.png

Despite a certain reluctance within the crypto community to align with so-called web2 entities, it's undeniable that the evolution of crypto infrastructure is intrinsically tied to advancements in fields such as mathematics, cryptography, cloud technology, and big data. Much like the intricate interlocking of a traditional Chinese mortise and tenon structure, each component within the crypto and software ecosystem is closely interconnected.

Technological development and application innovation are invariably influenced by underlying objective principles. Without the foundational support of elliptic curve cryptography, existing cryptocurrencies would be non-existent. Similarly, without a seminal research paper from MIT in 1985, practical applications of ZK would be off the table.

An intriguing pattern emerges from this. The widespread adoption and proliferation of node service providers hinge on the surge of global cloud services and virtualization technology. Meanwhile, the development of the on-chain data layer is predicated on superior open-source database architecture and services. This is the data framework solution that numerous BI products have adopted in recent years. Essentially, these are the technical prerequisites that startups must fulfill to achieve commercial viability. For web3 projects, a rule of thumb is that those employing advanced infrastructure have the upper hand over those reliant on outdated architecture. The erosion of OpenSea's market share by faster and more user-friendly NFT exchanges serves as a prime example.

Moreover, an obvious trend is emerging. AI and LLM technologies have generally matured and hold the potential for massive adoption. The question then arises: how will AI change the game of on-chain data?

Predict the future

Predicting the future is difficult, but we can begin by addressing the challenges encountered during blockchain development. The needs are straightforward: there's a requirement for accurate, timely on-chain data in a format that's easy to understand.

The current problem is that obtaining certain data necessitates the use of complex SQL queries. I believe this is why the feature of open-sourcing SQL code, provided by Dune, is popular in the crypto community. Ordinary users don't need to reinvent the wheel; they can simply fork and modify the contract address to create the desired graph.However, this is still too complex for average individuals who simply want to check liquidity or airdrop data under specific conditions.

I believe the first approach to helping users resolve this query problem involves the use of LLM.

In light of the prevailing complexities, we can pivot towards building a more user-centric 'data query' interface leveraging LLM. In the current state of affairs, it's expected for users to grapple with intricate query languages like SQL or GraphQL to pry out corresponding on-chain data from APIs or Studios. However, with LLM in our toolkit, we can usher in a more intuitive, human-friendly way of posing queries. With this approach, users can employ 'Natural Language' to articulate their questions, allowing LLM to transcribe these into suitable queries, ultimately furnishing users with the answers they seek.

figure15-llm.png

From a developer's point of view, AI can facilitate enhanced on-chain contract event parsing and ABI decoding. Currently, many DeFi contract details necessitate manual data parsing and decoding by developers. With AI's assistance, various contract disassembly techniques can be significantly improved, enabling the swift retrieval of corresponding ABIs. When paired with Large Language Model (LLM), this setup allows for intelligent parsing of function signatures and efficient processing of diverse data types. Furthermore, when merged with a 'stream computing' processing architecture, the system can manage real-time transaction data parsing and cater to users' immediate needs.

From a broader perspective, the goal of an indexer is to provide accurate data to users. As I emphasized earlier, the underlying problem with the on-chain data layer is that each piece of data is separated and isolated in different indexers' databases. To fulfill the purpose of various types of data, some designers decided to incorporate all chain data into a single database, allowing users to select from one data cloud table. Some protocols decided to include only a small part of the data, such as DeFi data and NFT data. However, the issue of data standard incompatibility still remains. Sometimes, developers need to take data from different sources and reformat it in their own database, which again increases their maintenance cost. Secondly, when a certain data provider goes down, they cannot migrate it to another provider in a timely manner.

figure16-complex-indexer.png

How can LLM and AI solve this solution? Well, LlamaIndex gave me an insight. What if developers don't need the indexer? And use a deployed agent service to directly read raw data on-chain? This agent combines the technology of an indexer and LLM. From a user's point of view, they do not need any information on API or any query language. They can directly ask the question and get feedback in real time.

figure17-maybe-the-future-aiweb3stack.png

Equipped with LLM and AI technologies, the agent is capable of comprehending and processing raw data, transforming it into a format that end-users can easily digest. It eliminates the need for users to grapple with complex APIs or query languages; they can simply pose their questions in natural language and receive real-time feedback. This feature enhances data accessibility and user-friendliness, inviting a broader user base to engage with on-chain data.

Additionally, the agent's approach addresses the issue of data standard incompatibility. Designed to decipher and process raw on-chain data, it can accommodate data in various formats and standards. Consequently, developers are freed from the task of reformatting data from different sources, thus alleviating their workload.

Of course, this may simply be speculation about the future trajectory of on-chain data. But in the realm of technology, it's often these audacious concepts and theories that spur transformative advances. We should remember that all monumental breakthroughs in history, from the invention of the wheel to the advent of blockchain, started as someone's hypothesis or 'wild' idea.

As we embrace change and uncertainty, we are challenged to persistently expand the limits of possibility. In this context, we envision a world where the convergence of AI, LLM, and blockchain cultivates a more accessible and inclusive technological landscape.

Chainbase champions this vision and tirelessly strives to bring it to fruition.

Our mission at Chainbase is to construct an open, user-friendly, transparent, and sustainable crypto data infrastructure. We aim to streamline the usage of this data for developers, removing the necessity for intricate backend technology stack reconstructions. By doing so, we aspire to herald a future where technology not only serves users—it empowers them.

I must clarify, however, that this is not our roadmap. Instead, it's my personal reflections as a developer relations on the recent evolution and progress of on-chain data. I appreciate the insightful suggestions and guidance from Chainbase's developers. As a conduit of communication in the Chainbase community, I invite individuals with related ideas to engage with us in our community.

This is Masterdai, and I will be with you all the time. :-)

About Chainbase

Chainbase is an all-in-one data infrastructure for Web3 that allows you to index, transform, and use on-chain data at scale. By leveraging enriched on-chain data and streaming computing technologies across one data infrastructure, Chainbase automates the indexing and querying of blockchain data, enabling developers to accomplish more with less effort.

Want to learn more about Chainbase?

Visit our website chainbase.com Sign up for a free account, and Check out our documentation.

WebsiteBlogTwitterDiscordLink3