How to save users $100M in gas fees & headaches | Paper's playbook for massively scalable on-chain transactions

Stuck queues, gas wars, and unexpected errors. An overview of why these issues occur and how you can build a robust, high-performing system.

How to save users $100M in gas fees & headaches | Paper's playbook for massively scalable on-chain transactions

Earlier this year in May, the Otherside Metaverse project by Yuga Labs dropped its NFTs which netted them over $300M in revenue at that time. Equally scary, users spent over $150M on gas fees trying to get their stuck transactions to go through.

That’s pretty crazy.

If you’re a developer who has encountered any of the following problems:

  • Transactions that are stuck pending for a long time
  • Transactions that return an unexpected error and succeed after a retry
  • One pending transaction that blocks other transactions from completing
  • Gas fees that look ridiculous and have you waiting for the price to come back within reason

…then you’re in for a treat!

Today, I’m going to share the pieces required to build a robust system that addresses these issues and more.

Okay, but why should I believe you?

First off: Hi, I’m Winston, an engineer at Paper that built our backend infrastructure to tackle popular NFT drops that require processing hundreds of concurrent on-chain transactions. (we’re hiring!)

Introductions out of the way, let’s see how the Otherside Metaverse could have saved their users $100M on gas!

Background

Pushing transactions onto the blockchain is hard.

The blockchain is public and is shared among everyone. As such, it’s not always possible to guarantee that your requests will be fulfilled immediately, in order, or within a given time duration 😢.

Existing tools also do not help make tracking requests easy and there is no way you can set up multiple dependent transactions.

I will first share a high-level overview of why some of these issues happen. Then, I’ll lay out practical solutions for each of them to build a robust and high-performing system.

Ready? Let’s get started.

Why issues arise: Pruning transactions

Imagine you’re a schoolteacher and you have 50 total chocolate bars to offer your students.

How would you best distribute them?

The most capitalist strategy is to auction the chocolate off to the 50 highest bids.

Now, replace the chocolate bars with open spots on an Ethereum block and limit the number of bids per student. You essentially have an Ethereum client. For example, the Go Ethereum client has a default limit of 16 executable transaction slots per account.

What does this mean? Each student can only make up to 16 pending bids.

What happens if you submit more than 16 concurrent transactions from a single wallet? Unexpected things happen. Some of your transactions might be stuck pending for a really long time. Some transactions might be dropped. Or maybe, everything will go through 🤞.

But, 16 is a theoretical pending transaction limit. In practice, it’s much more complicated. Transactions can be split among multiple nodes. Nodes can broadcast transactions to other nodes. Your RPC provider might perform optimizations on your behalf and rebroadcast these transactions. There’s clearly a lot more going on behind the scenes.

The actual process for how a transaction is broadcast and managed is beyond the scope of this post, but StackOverflow has some interesting answers to dig further. The result of having these limits means that transactions get dropped, or pruned.

Let’s talk about some of the strategies we employ at Paper to ensure the highest success rates.

Re-tries and nonce management

This is the obvious first thing to do. Each time we see a transaction fail, we simply try again.

How do we know when a transaction fails? It would be nice if the code throws an error. Most of the time it does. Unfortunately, sometimes the transaction gets stuck waiting forever. This can happen anytime a transaction gets dropped without replacement.

Okay, maybe we can retry after some time passes. Problem solved?

Not necessarily, now we run into a new issue: What if we timed out for some other reason outside of the transaction getting dropped? Common examples include the blockchain being congested, gas prices being extremely high, the blockchain might be experiencing a partial outage, or our RPC node might be experiencing intermittent issues. If we naively retry after some time, we will have retried the same transaction twice.

In the best case, the contract has idempotency (one-time use) guards and, the second transaction fails, and we waste gas on a failed transaction. In the worst case we mint (and pay for) an NFT twice, which is a great surprise for buyers but a headache for us and the merchant.

How can we prevent this?

We manage the wallet’s nonce. The nonce is an incrementing numeric identifier that instructs nodes on how to order transactions from the wallet. The nonce starts at 0 and increases by one for each successfully mined transaction. Two transactions from the same wallet cannot use the same nonce, and nonce values cannot be skipped (nonce 5 must be successfully mined before nonce 6 can be mined).

You might be tempted to simply read the wallet’s nonce on-chain before sending each transaction, but this approach quickly breaks down at any scale. Why? The on-chain nonce does not account for pending transactions. Two requests made around the same time might read the same on-chain nonce and try to send transactions with the same nonce. This error might look like you successfully prevented a duplicate claim, but you actually missed a real customer’s order.

Paper assigns and tracks nonces for each pending transaction and retry failures to ensure the transaction completes at most once. When it is successfully mined, we flag the nonce as used so future transactions know to use the next available nonce.

Quick recap: We discussed error handling, timeouts, and nonce management. With these pieces, you can now implement reliable retries.

But, now we have a new problem: what happens when our tracked off-chain nonce and the wallet’s on-change nonce are out of sync? Enter, self-healing.

Self-healing

Remember when UST de-pegged? The same can happen with the off-chain nonce since it’s managed independently from the on-chain value. Since nonces must be strictly sequential, if our off-chain nonce is off by just one, new transactions may never execute and our whole system stops working. This de-sync risk effectively becomes a single point of failure in our entire system. That’s not great.

So, where does self-healing fit in?

Self-healing is the process of applying heuristics and safeguards to make sure that the backend detects and recovers from issues. At Paper, we set up an automatic detection system to self-heal the nonce managers on our backend wallets.

The system periodically checks the wallet’s nonce on-chain and re-sync if needed. It also automatically fills nonce gaps. For example, if the wallet’s on-chain nonce is 5 but the tracked nonce is 6, we automatically submit a transaction with nonce 5 so that transactions starting with nonce 6 can proceed. Lastly, if all else fails, our system pings our engineers (I'm there 👦🏻) to manually review and correct any errors.

So, we have retries and self-healing queues. What’s next?

I’m glad I made you ask because as we gain traction, we inevitably start to run into hard limits per wallet. These restrictions by the chain’s consensus protocol itself prevent a single wallet from sending too many transactions in a given duration.

We need to scale horizontally.

Fleet of wallets

Managing a fleet of wallets means that we are able to exceed the throughput limits imposed by the node’s software for a single wallet.

What does this mean for our customers? Higher burst throughput and more consistent delivery times.

However, in exchange, we now take on additional operational complexity and risk. depending on how much funds to hold in each wallet compared to the single-wallet approach.

If we opt to keep the same level of funds per wallet, we now have more total crypto funds in circulation. This means more funds are drained in the event of a bug or hack. We also open ourselves to higher currency risk (ETH goes up and down like me on a Six Flags rollercoaster).

If we opt to keep the same total funds in circulation, each wallet will have fewer funds. Our team would need to be concerned with monitoring and topping up wallets frequently.

We also need a way to choose which float wallet to use for a given transaction attempt. Here KISS makes sense: Choosing by round-robin or the wallet with the most funds suffices in practice. We also need to be diligent about topping up and balancing funds. As expected both operations cost gas, so the more wallets in use and the lower amount of funds kept in them, the higher the gas costs to maintain the floats.

At Paper, all of the management of funds and scaling of wallets is automatically taken care of for you so can simply focus on making sales.

At this point, our system is fairly robust. But, there are still cases where no matter how well-scaled we are, we'll still face issues. This brings us to gas wars. When a gas war happens, you’d want to know and perhaps wait it out…

Let’s take a look at some of the strategies Paper has learned and implemented over the past year to navigate gas wars.

Dealing with gas wars

Remember our students bidding for chocolate example?

The chocolate (spot on the blockchain) ultimately goes to the highest bidders.

At a high level, this means a later request can pay more in network fees to be processed before earlier transactions.

This isn't a problem for us most of the time: No one is really incentivized to pay 20% more and have their purchase completed 20 seconds earlier.

But, when a popular item drops, all bets are off.

There are two sources of gas wars:

  • Gas wars caused by ourselves
  • Gas wars caused by others on the blockchain

In the former case, we limit the rate at which we send transactions onto the blockchain. In the latter case, we dynamically adjust our gas limits based on current on-chain conditions.

Rate limiting

If Paper is currently processing enough transactions to cause a spike in gas prices, we, fortunately, have some control over fixing the issue.

The obvious solution is to slow things down.

How slow? Enough for the blockchain to not raise the base price for gas fees too rapidly.

I won’t go too deep into how block prices fluctuate on EVM but there are many great articles. In deciding on the speed of transactions, we adjust our rate limit to target 15 million gas used per block. One thing to notice when sending transactions is to observe the total included gas per block and how much above or below it is from this target. We start slowing down the transactions when the gas per block exceeds 15 million gas units and approaches Ethereum’s upper limit of 30 million.

As a naive alternative, simply rate-limiting transactions based on time duration is a good enough solution for most use cases.

Automatic gas adjustments

Sometimes there are other popular dApps, games, or protocols causing high amounts of network congestion and high gas fees.

Their volume on-chain might cause our existing transactions to get stuck.

Why? The gas limit we submitted earlier might no longer be high enough to be accepted into the next block. As such, our transaction might be stuck for a long time or even time out if the congestion lasts for a long time.

With a solid system to handle pruning as mentioned earlier, we incorporate dynamic gas adjustments into our system. For each retry when a block times out, we compute a more up-to-date gas limit value. We bump that value a little higher so we have some buffer if gas prices continue to rise. This is a useful practice since gas values are a lagging indicator estimated from transactions on previous blocks.

Of course, naively paying the full gas price requested from the blockchain is a great way to get sticker shock from an unexpected bill. Paper has reasonable upper limits in place to keep costs manageable, especially since many of our customers offer a gasless experience for buyers by sponsoring gas fees themselves. During these rare gas spike events, our transactions are queued until gas prices fall back to a reasonable range.

Conclusion

And that’s it! This list is non-exhaustive but covers a lot of our learnings at Paper from the past year of research, trial-and-error, retrospectives, and brainstorming. Hopefully, this information will help you get your infrastructure up and running to confidently and reliably handle blockchain transactions at real volume.

And if you’re in the business of providing real utility for NFTs, Paper handles all of this so that you can keep doing what you do best: create value and build things people love.

If this sounds interesting, check us out at Paper.xyz. If you're a dev who is excited about what we're building, Paper is hiring passionate engineers!


About Paper

Paper builds NFT on-ramping infrastructure to help developers and brands bridge web2 and web3. Paper enables email-based wallet creation, NFT purchases with credit cards, & gasless transactions for non-crypto native users. To date, Paper has worked with over 2500+ developers and brands and has processed over 220,000 credit card and cross-chain crypto NFT transactions.