Empirical Evaluation of Overestimation Bias in Q-learning and Double Q-learning

Introduction

Consider four autonomous agents confronting a fundamental cooperation dilemma that characterises multi-agent systems: whether to contribute resources to collective welfare or pursue individual optimisation. This investigation examines such dynamics through the Public Goods Game (PGG), where each agent possesses initial endowment $e_i$ and determines contribution levels to a shared resource pool.

The mechanism operates as follows: contributions are aggregated, multiplied by factor $r$ , then redistributed equally amongst all participants. Whilst collective contribution yields mutual benefit, individual rationality precipitates a concerning equilibrium outcome.

The Analytical Framework: Agent $i$ evaluates the following payoff structure:

$\pi_i(\mathbf{c})=e_i-c_i+\frac{r}{n}\sum_{j=1}^n c_j$

The initial term, $e_i-c_i$ , represents retained private resources. The subsequent term, $\frac{r}{n}\sum_{j=1}^n c_j$ , constitutes the agent's proportional share of the amplified collective pool. The fundamental tension emerges through marginal analysis:

$\frac{\partial \pi_i}{\partial c_i} = -1 + \frac{r}{n}$

Public Goods Game mechanism showing agent contributions, multiplication factor, and redistribution process

The Free-Riding Equilibrium: Given the constraint $r < n$ (ensuring non-trivial strategic interaction), this derivative remains strictly negative. Each marginal unit contributed imposes greater individual cost than benefit received. Rational agents consequently converge towards zero contribution—the Nash equilibrium $\mathbf{c}^* = (0,0,\ldots,0)$ .

The Efficiency Paradox: Conversely, universal full contribution would yield individual payoffs of $\pi_i(\mathbf{c}^{opt})=re$ rather than merely $e$ —representing an efficiency multiplier precisely equal to $r$ . This Pareto optimal outcome demonstrates potential collective prosperity, whilst individual rationality systematically undermines cooperative achievement.

Algorithmic Intervention: The investigation proceeds to examine whether systematic biases in reinforcement learning algorithms might facilitate escape from this cooperation trap. Specifically, we investigate whether Q-learning's notorious overestimation bias—conventionally regarded as algorithmic deficiency—could paradoxically enhance cooperative outcomes, whilst Double Q-learning's bias correction might inadvertently impede cooperation.

We examine Q-learning with update mechanism:

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [ u_t + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) ]$

where the $\max$ operator induces systematic value overestimation, contrasted with Double Q-learning's debiasing through decoupled action selection and evaluation.

Research Trajectory: Through systematic empirical analysis, we determine conditions under which algorithmic bias becomes beneficial, examine how resource heterogeneity influences cooperative dynamics, and establish why environmental context supersedes algorithmic sophistication in determining cooperative outcomes.

Methodological flowchart showing experimental design and analysis pipeline

Results

Algorithm Performance Across Incentive Structures

Our empirical analysis reveals a critical threshold effect at MPCR = 0.625, where algorithmic superiority reverses. Under low-incentive conditions ( $r_t \leq 2.5$ ), Double Q-learning consistently outperforms Q-learning in cooperation levels, while high-incentive environments ( $r_t > 2.5$ ) favor Q-learning for collective welfare maximization.

Average contribution as percentage of endowment versus multiplication factor for Q-learning and Double Q-learning over last 1000 episodes

Mechanistic Explanation: In resource-constrained scenarios, Q-learning's overestimation bias amplifies the perceived value of selfish strategies, leading to systematic free-riding. Double Q-learning's bias correction enables more accurate value estimation and sustained cooperation. Conversely, in high-incentive environments, overestimation bias functions as optimistic exploration, facilitating discovery of highly cooperative strategies.

Endowment Heterogeneity Effects

Analysis of agent behavior under heterogeneous initial endowments reveals distinct algorithmic characteristics. In high-incentive environments, both algorithms develop endowment-proportional contribution patterns, though with different stability properties.

Individual contributions and payoffs over episodes for r_t = 3.5 showing endowment stratification and crossover effects

Q-learning produces stable wealth hierarchies with high volatility, while Double Q-learning enables crossover events where lower-endowed agents can outperform wealthier counterparts. Both algorithms exhibit emergent conditional cooperation through payoff-based feedback mechanisms, resembling indirect reciprocity without explicit communication.

Fairness and Action Space Complexity

Fairness analysis using Shapley value variance demonstrates context-dependent effects of action space granularity. In low-incentive scenarios, Q-learning achieves equity through uniform non-contribution, while Double Q-learning creates temporary imbalances through exploratory behavior.

Variance of Shapley values over episodes for Q-learning and Double Q-learning at r_t = 2.0

High-incentive environments reverse this pattern: Double Q-learning maintains lower variance and greater equity, while Q-learning's bias reinforces inequality through progressive advantage accumulation.

Discussion and Conclusions

This research demonstrates that algorithmic bias properties interact systematically with environmental incentive structures, yielding three key insights: First, algorithm effectiveness exhibits threshold-dependent behaviour at MPCR = 0.625, where Double Q-learning sustains cooperation under resource constraints through bias correction, whilst Q-learning achieves superior collective welfare in high-incentive scenarios via optimistic exploration. Second, heterogeneous endowments reveal distinct wealth distribution dynamics—Q-learning reinforces hierarchies with volatility, whereas Double Q-learning enables mobility through crossover events. Third, fairness outcomes depend critically on bias-environment interactions, with action space complexity amplifying these effects.

Theoretical Implications

These findings challenge the conventional view of overestimation bias as purely detrimental. In high-incentive cooperative settings, bias functions as beneficial optimistic exploration, accelerating discovery of mutually beneficial strategies. The emergence of conditional cooperation through payoff-based feedback demonstrates that indirect reciprocity mechanisms can develop without explicit communication, expanding theoretical frameworks for cooperative emergence.

Practical Applications

For algorithm selection: employ Double Q-learning when prioritising stability and equity under resource constraints (environmental governance, public goods provision); utilise Q-learning when maximising collective performance in high-opportunity environments (innovation ecosystems, collaborative optimisation). The context-dependent nature necessitates adaptive approaches considering environmental incentive structures rather than universal preferences.

Future Research

Priority areas include adaptive bias mechanisms that dynamically adjust correction levels, integration of fairness constraints into learning objectives, and extension to continuous action spaces with deep reinforcement learning architectures.