CImages8ade6dda-5e16-4231-a405-17a1db70189b

Navigating the Wild West: Taming Backtest Overfitting in Deep Reinforcement Learning for Crypto Trading

Ever found yourself staring at a chart, mesmerized by the sheer, exhilarating chaos of cryptocurrency markets? You know, that heart-pounding dance between exhilarating highs and stomach-churning lows? It’s a landscape infamous for its volatility, presenting both dizzying opportunities and truly formidable challenges for anyone brave enough to trade within it. For years, quants and traders have sought an edge, a consistent signal amidst the noise. And lately, Deep Reinforcement Learning (DRL) has emerged as a genuinely powerful, almost mystical tool in this arena, allowing intelligent models to learn and adapt trading strategies by interacting directly with these complex market environments. It sounds like magic, doesn’t it? Like you’ve finally found the secret sauce.

But here’s the thing, and it’s a significant ‘but’: a common, often painful, pitfall in developing DRL-based trading systems is something we call backtest overfitting. You build a model, you pour hours into it, it performs spectacularly on historical data – generating hypothetical millions, perhaps – but then, when you unleash it on future market conditions, it just… fizzles. It fails to generalize effectively. It’s like training a marathon runner only on flat tracks, then expecting them to ace a mountain ultra-marathon; the conditions are simply too different, the learning too specific.

Investor Identification, Introduction, and negotiation.

Understanding Backtest Overfitting: The Ghost in the Machine

Let’s get down to brass tacks: what exactly is this elusive beast we call backtest overfitting? Simply put, it occurs when a trading model, especially one as sophisticated as a DRL agent, becomes excessively tailored, or ‘memorized’, to the historical data it was trained on. Think of it like a student who crams for an exam by memorizing every single answer from past tests, rather than understanding the underlying concepts. They might ace those specific past tests, but put a new question in front of them, and they’re completely lost.

In the context of financial markets, this means your model isn’t learning the fundamental, enduring market dynamics or true price action patterns. Instead, it’s inadvertently capturing the noise, the random fluctuations, the one-off anomalies, and even the quirks of the specific historical period you used for training. These ephemeral characteristics, which don’t reflect any underlying, repeatable market behavior, get baked into the model’s ‘brain’. This over-optimization leads to models that appear incredibly robust, almost infallible, during the backtesting phase. Your equity curve looks smooth, profits are consistent, and you might start daydreaming about early retirement. But then, when live trading begins, the performance collapses. Your DRL agent, which achieved those stellar returns on past data, simply can’t adapt to new, unseen market scenarios because it hasn’t learned the general rules of the game; it’s only memorized specific plays.

Why is this particularly insidious in crypto? Well, unlike traditional markets that have decades of relatively stable data, crypto is still relatively young and exceptionally non-stationary. What worked yesterday might not work today, and what works today definitely might not work tomorrow. The market structure shifts, regulatory landscapes evolve, and entirely new assets emerge, sometimes overnight. Training a DRL model on this volatile, ever-changing environment without proper precautions is like trying to hit a moving target while blindfolded. It’s incredibly challenging to build a robust DRL system that can truly thrive in this dynamic ecosystem without falling prey to the overfitting trap. So, how do we tackle this?

Strategies to Mitigate Overfitting in DRL Models: Building Resilience

Alright, so we’ve identified the problem. Now for the solutions. To significantly enhance the reliability and generalization capability of your DRL models in the high-stakes world of cryptocurrency trading, you’ll want to implement a multi-faceted approach. Think of it as building a strong, adaptable foundation rather than a fragile house of cards. Here are some key strategies:

1. The Art of Regularization: Preventing Memorization

Regularization techniques are your first line of defense against a model that’s trying too hard to remember everything. These methods essentially add a penalty to the model’s complexity during training, discouraging it from becoming overly reliant on specific data points and forcing it to learn more generalizable patterns. It’s like telling your model, ‘Hey, don’t just memorize; understand!’

L1 and L2 Regularization (Weight Decay): These are often applied to the model’s weights. L1 regularization (Lasso) encourages sparsity by pushing some weights exactly to zero, effectively performing feature selection – pruning away less important connections. L2 regularization (Ridge) penalizes large weights, pushing them closer to zero but rarely to absolute zero. This generally leads to smaller, more spread-out weights, which means no single input feature has an excessively strong influence, making the model less sensitive to noise in individual data points. Imagine trying to balance a scale; L2 regularization ensures you’re not putting all your weight on one side.
Dropout: This is a fantastic technique specifically for neural networks. During training, dropout randomly ‘turns off’ a certain percentage of neurons (and their connections) in each layer for each training step. This forces the remaining neurons to learn more robust features because they can’t rely on the presence of any single other neuron. It’s like training several ‘mini-models’ within one, and then averaging their predictions. When the full network is used for inference, all neurons are active, but their weights are scaled down to account for the dropout during training. This prevents complex co-adaptations between neurons, which can lead to memorization of training data quirks. It really enhances a model’s ability to generalize by preventing over-specialization.
Batch Normalization: While not strictly a regularization technique in the same vein as L1/L2 or Dropout, Batch Normalization layers, inserted between hidden layers, normalize the inputs to each layer. This helps stabilize and speed up the training process and can also have a subtle regularizing effect. It reduces the internal covariate shift, meaning the distribution of inputs to a layer remains more consistent, making the model less sensitive to the initial parameters or slight changes in subsequent layers.

Implementing these techniques is often a matter of adding a few lines of code to your DRL framework, but their impact on model robustness can be profound. They compel the model to find broader statistical patterns, rather than getting bogged down in the minute details of your training set.

2. Expanding and Diversifying Training Data: A Richer Diet for Your Model

Just like a human needs a balanced diet, your DRL model thrives on a rich, varied diet of data. Relying solely on a narrow slice of history, say, a bull market, is a recipe for disaster when conditions inevitably shift. You’re simply not giving your model enough exposure to the real world.

Broader Range of Historical Data: Start by collecting historical price data that spans various market cycles. This means including periods of intense volatility (like the 2021 bull run or the 2022 bear market), periods of consolidation, sideways markets, and even flash crashes. The more diverse the market conditions your model sees, the better equipped it will be to handle novel situations. Don’t just grab a year’s worth of 1-minute data; grab five years, or even ten, if available and relevant.
Diverse Data Sources: Beyond just candlestick data (Open, High, Low, Close, Volume), integrating diverse data sources provides a much more comprehensive understanding of market dynamics. Consider:
- Order Book Information: Level 2 data, showing bids and asks at various price levels, gives insights into immediate supply and demand pressures, liquidity, and potential market manipulation.
- Financial News & Sentiment Data: News events, social media trends, and overall market sentiment can significantly impact crypto prices. Natural Language Processing (NLP) techniques can extract sentiment scores from news headlines, Twitter feeds, or Reddit discussions, providing powerful non-numerical features.
- On-Chain Data: For cryptocurrencies, blockchain data offers unique insights – transaction volumes, active addresses, mining difficulty, exchange inflows/outflows. These fundamental indicators can often paint a very different picture than just price action alone.
- Macroeconomic Indicators: While crypto is often seen as decoupled, broader economic trends, interest rate changes, or even global political events can exert influence.

By feeding your model this richer, multi-dimensional dataset, you’re giving it more context, more ‘features’ to learn from, reducing its reliance on simple price patterns that might just be noise. It helps the model understand the broader ecosystem within which prices move, rather than just the prices themselves. It’s like watching a movie with sound and subtitles instead of just observing the silent screen; you get so much more context.

3. Enhanced Data Splitting and Validation: The Unforgiving Judge

This is perhaps one of the most critical aspects of preventing backtest overfitting, especially in time-series data like financial markets. You can’t just randomly shuffle your data and split it into training, validation, and test sets. Why? Because future data points might (and usually do) depend on past ones. If you randomly split, your ‘test’ set might contain data points prior to those in your ‘training’ set, creating an unrealistic scenario where your model has implicitly ‘seen’ the future.

Walk-Forward Validation: This is the gold standard for financial backtesting. Instead of a single train/test split, you perform a series of backtests. You train your model on an initial historical period (e.g., January 2018-December 2019), then test it on the next consecutive period (e.g., January 2020-March 2020). After that, you ‘walk forward’: you retrain the model by extending the training period to include the data you just tested on (e.g., January 2018-March 2020), and then test it on the next consecutive slice (e.g., April 2020-June 2020). This process repeats, mimicking how you’d truly train and deploy a model in the real world. This approach provides a much more realistic assessment of your model’s performance across different market conditions and allows it to adapt incrementally.
K-Fold Cross-Validation (with time-series considerations): While standard K-fold isn’t directly applicable for time-series without modification, time-series specific K-fold strategies can be used. For instance, you can use a ‘blocked’ or ‘unrolled’ cross-validation where folds maintain their temporal order. You always train on data before the validation fold. This still ensures your model is evaluated on multiple, distinct time segments, providing a more robust error estimate than a single train/test split.
Purging and Embargoing: When you’re dealing with features that are derived from future data (e.g., target labels based on future returns), you need to be extremely careful to prevent data leakage. Purging involves removing training observations that overlap with the testing period (e.g., if a 3-day return is a target, you need to ensure the training data doesn’t contain any overlapping 3-day periods with your test set). Embargoing takes this a step further by removing a small buffer period after the testing set to ensure no information leaks from the test set back into the training data through future-looking features. These are more advanced techniques but absolutely crucial for rigorous backtesting in quantitative finance.

These methods are demanding on computational resources, but they’re indispensable. They enforce a strict separation between what the model has ‘seen’ and what it’s being ‘tested’ on, dramatically reducing the risk of a false sense of security derived from an overfit backtest.

4. Simplifying Model Complexity: Less Can Be More

It’s a common misconception that more complex models are always better. Sometimes, they’re just more complex. When it comes to DRL, especially for trading, opting for simpler model architectures with fewer parameters can significantly decrease the likelihood of overfitting. This aligns with the principle of Occam’s Razor: ‘the simplest explanation is usually the best one.’

Bias-Variance Tradeoff: This is the core concept here. A very simple model (high bias) might underfit; it’s too basic to capture the underlying patterns. A very complex model (high variance) might overfit; it captures too much noise and doesn’t generalize. The goal is to find the ‘sweet spot’. For DRL agents, this might mean using fewer layers in your neural network, fewer neurons per layer, or simpler activation functions. You’re reducing the model’s ‘capacity’ to memorize.
Feature Engineering: Sometimes, instead of making your model more complex to interpret raw data, you can make the data more informative. Smart feature engineering – creating relevant technical indicators (RSI, MACD, Bollinger Bands), volatility metrics, or custom fundamental ratios – can provide your simpler DRL agent with clearer signals, reducing the need for it to ‘discover’ these complex relationships on its own. A well-engineered feature set can dramatically improve the performance of a simpler model, often outperforming a complex model on raw data.

Simplified models are less prone to capturing the inherent noise in financial data and are therefore more likely to generalize well to new, unseen market data. They tend to be more interpretable too, which is a massive bonus when you’re trying to understand why your bot is making certain decisions.

5. Advanced Hyperparameter Tuning & Early Stopping: The Fine Adjustments

Hyperparameters are the settings that control the learning process itself, not parameters learned by the model. Things like learning rate, number of layers, batch size, or the specific DRL algorithm’s parameters (e.g., discount factor, entropy coefficient). Getting these right is crucial, and advanced tuning methods can help you find optimal settings without overfitting.

Grid Search & Random Search: Grid search exhaustively tries every combination of specified hyperparameter values. It’s thorough but can be computationally expensive. Random search, on the other hand, samples values randomly from specified distributions. Surprisingly, random search often finds better models faster than grid search, especially in high-dimensional hyperparameter spaces, as it explores more diverse combinations.
Bayesian Optimization: This is a more sophisticated approach. It builds a probabilistic model of the objective function (e.g., validation performance) and uses it to select the next promising set of hyperparameters to evaluate. It intelligently balances exploration (trying new, uncertain regions) and exploitation (refining promising regions), often finding good hyperparameters with significantly fewer evaluations than grid or random search. Libraries like Optuna or Hyperopt are fantastic for this.
Genetic Algorithms: Inspired by biological evolution, these algorithms treat hyperparameter sets as ‘individuals’ in a population that evolve over generations, with fitter individuals (those with better performance) being selected and mutated. They can explore complex hyperparameter landscapes very effectively.
Early Stopping: This is a critical technique during the training process itself. You continuously monitor the model’s performance on a separate validation set (crucially, not the training set). As training progresses, the model’s performance on the training data will typically keep improving. However, at some point, its performance on the validation set will stop improving and might even start to decline. This signals that the model is beginning to overfit to the training data. Early stopping simply halts the training process at this optimal point, preventing the model from learning the noise. It’s like knowing when to stop adding spices to a dish – too much, and you ruin it. This simple but powerful method saves computational resources and, more importantly, prevents the model from over-optimizing to specific training patterns.

6. Testing on Unseen Data: The Ultimate Reality Check

After all the training, validation, and tuning, the moment of truth arrives. You simply must evaluate your models on completely new data, data that was never, ever touched during any part of the training or hyperparameter tuning process. This is your truly independent ‘out-of-sample’ test.

Dedicated Test Set: This is the non-negotiable. Allocate a portion of your most recent historical data solely for final evaluation. This set should strictly mimic future live conditions – no peeking, no adjustments based on its performance. If your model performs well here, you’ve got a much stronger signal of real-world viability.
Simulating Trading Scenarios: Beyond just backtesting on static historical data, consider more dynamic simulation environments. Market replay data, where you feed your model data tick-by-tick as if it were happening live, can help in understanding how the model would perform under different real-time market conditions, including slippage, order execution delays, and liquidity constraints. This is a step closer to real trading than just static backtesting.
Paper Trading/Live Testing: Before deploying any capital, run your DRL agent in a ‘paper trading’ environment provided by many exchanges or brokers. This is a live simulation using real-time market data but with virtual money. It’s the ultimate proving ground. This allows you to observe its behavior, identify any unforeseen bugs, and fine-tune its risk parameters without any financial risk. This phase often reveals subtle issues that even the most rigorous backtests can miss, like latency issues or unexpected order rejections.

Remember, a great backtest is just a story. The real story begins when your model faces the music of live, unseen data. If it struggles there, you know you’ve got more work to do on the overfitting front.

7. Adopting Ensemble Learning: The Power of Collaboration

Ensemble learning is a sophisticated approach where you combine predictions from multiple individual models to produce a single, more robust, and often more accurate prediction. The magic here lies in the ‘wisdom of crowds’ principle: individual models might have their own biases or weaknesses, but when their diverse perspectives are aggregated, the errors tend to cancel out, leading to a more reliable overall outcome. It’s like assembling a team of expert traders, each with their own unique style, to make a collective decision.

Bagging (e.g., Random Forest): Bagging (Bootstrap Aggregating) involves training multiple instances of the same type of model on different random subsets of the training data (with replacement, meaning some data points can be sampled multiple times). For classification, you might take a vote; for regression (like predicting price movements), you’d average their outputs. Random Forests are a classic example: they build an ensemble of decision trees, each trained on a different bootstrapped sample of data and only considering a random subset of features at each split. This diversity helps reduce variance and overfitting.
Boosting (e.g., AdaBoost, Gradient Boosting, XGBoost, LightGBM): Boosting takes a sequential approach. It trains weak learners (often simple decision trees) iteratively, with each subsequent learner focusing on the errors made by the previous ones. The ‘weights’ of misclassified or poorly predicted samples are increased, making them more important for the next learner. This creates a strong learner from a collection of weak ones. XGBoost and LightGBM are incredibly popular and powerful boosting algorithms known for their speed and accuracy in various machine learning tasks.
Stacking (Stacked Generalization): This is a more advanced ensemble technique. Here, you train multiple diverse models (called ‘base models’ or ‘level-0 models’) on the training data. Then, you train a ‘meta-model’ (or ‘level-1 model’) on the predictions of these base models. The meta-model learns how to best combine the outputs of the base models. For instance, your base models might be a DRL agent, a linear regression, and a support vector machine, and then a meta-model (perhaps a simple neural network) learns to weight their predictions based on the validation set. Stacking often yields impressive performance gains because it leverages the strengths of diverse model types.

Ensemble methods are particularly effective in reducing overfitting because they average out the noise and errors that individual models might pick up. If one model overfits to a specific anomaly, its effect is diluted when combined with others that didn’t. This leads to more robust, reliable, and often more accurate trading strategies, making them a superb choice for DRL in volatile crypto markets.

Implementing DRL in Cryptocurrency Trading: A Practical Blueprint

Bringing a DRL agent to life for cryptocurrency trading is a multi-step journey, requiring meticulous attention at each phase. It’s not just about coding; it’s about thoughtful design, rigorous testing, and a deep understanding of market dynamics.

1. Data Collection: The Foundation of Intelligence

This isn’t just about grabbing some CSVs. It’s about laying a solid, clean foundation for your DRL agent to learn from. You need comprehensive historical price data for the cryptocurrencies you’re interested in, ensuring it spans those various market conditions we talked about earlier.

Sources & Granularity: Access data via exchange APIs (Binance, Coinbase Pro, Kraken, etc.) for high-frequency data, or historical data providers like CryptoCompare or Kaiko for aggregated data. Decide on your desired granularity: tick data for high-frequency, 1-minute, 5-minute, or hourly for swing trading, or daily for longer-term strategies. The choice impacts your action frequency and computational load.
Data Cleaning & Preprocessing: Raw data is messy. You’ll encounter missing values, outliers, corrupted data points, and inconsistencies. Implement robust cleaning pipelines: interpolation for small gaps, outlier detection and handling (e.g., winsorization), and ensuring proper time zone and timestamp alignment. Then, normalize or standardize your features; neural networks perform much better when inputs are scaled.
Feature Engineering for DRL: DRL agents learn from ‘states.’ What constitutes a ‘state’ is crucial. Beyond raw price, engineer features that encapsulate market information:
- Technical Indicators: RSI, MACD, Bollinger Bands, Moving Averages, Stochastic Oscillator – these provide condensed information about price momentum, volatility, and trend. Your agent can learn to interpret these signals.
- Volume Metrics: Volume spikes, volume-weighted average price (VWAP), and order flow imbalances offer insights into market activity and conviction.
- Lagged Features: Include past prices, returns, or indicator values to provide the agent with a sense of market history within its current state.
- Volatility Measures: Historical volatility, implied volatility (if options data is available), or average true range (ATR) can help the agent understand market risk.

Thoughtful feature engineering can drastically improve your agent’s learning efficiency and performance, giving it a clearer picture of the market it needs to navigate.

2. Model Architecture Design: The Brain of Your Bot

Developing a neural network architecture suitable for DRL isn’t a one-size-fits-all problem. You need to consider the complexity of the market you’re targeting, the specific trading objectives (e.g., maximizing profit, minimizing drawdown, or maximizing Sharpe ratio), and the nature of your data.

Choosing the Right DRL Algorithm:
- Q-learning/DQN (Deep Q-Networks): Good for discrete action spaces (buy, sell, hold). Simple, but can struggle with continuous action spaces or complex state spaces. Useful for understanding basic agent behavior.
- Policy Gradient Methods (e.g., REINFORCE): Directly optimize a policy network, allowing for continuous action spaces. Can be unstable.
- Actor-Critic Methods (e.g., A2C, A3C, PPO, SAC): These combine policy-based and value-based methods. An ‘actor’ learns the policy (what action to take), and a ‘critic’ learns the value function (how good that action is). They often offer a good balance of stability and performance. Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are particularly popular and robust choices for continuous control tasks, which often mirror financial trading actions (e.g., ‘buy X amount’).
State Representation: How will you present the market information to your DRL agent? This often involves a vector of your engineered features. For more complex inputs like order book snapshots or raw price series, you might consider Recurrent Neural Networks (RNNs) like LSTMs or GRUs, which excel at processing sequential data, or Convolutional Neural Networks (CNNs) if you treat market data as an ‘image’ or grid.
Action Space: Define what actions your agent can take. Is it discrete (buy, sell, hold, close position)? Or continuous (what percentage of portfolio to allocate, how many units to buy/sell)? The choice significantly impacts the DRL algorithm you’ll use.
Reward Function Design: This is arguably the most important part of DRL for trading. The reward function is how you tell your agent what ‘success’ looks like. A simple reward could be the change in portfolio value. However, you might want to penalize drawdowns, reward risk-adjusted returns (like Sharpe ratio), or encourage consistent positive returns over extreme swings. A poorly designed reward function can lead to an agent that learns to game the system in unexpected, undesirable ways.

This phase requires experimentation. There’s no single ‘best’ architecture; it’s about iterative refinement based on your specific trading goals and data characteristics.

3. Training and Validation: The Crucible of Learning

With your data prepared and your architecture designed, it’s time to train your DRL agent. This is where you implement all those crucial regularization and validation techniques we just discussed to mitigate overfitting. Without these, you’re essentially building on quicksand.

Environment Design: Your DRL agent needs an environment to interact with. This is typically a custom simulation environment that takes your historical data, processes agent actions (buy, sell), updates the portfolio, calculates rewards, and transitions to the next state. It must accurately reflect market mechanics, including fees, slippage, and minimum order sizes.
Training Loop: Implement a robust training loop. This involves running episodes (simulated trading sessions), collecting experiences (state, action, reward, next state), and using these experiences to update the agent’s neural networks. Monitor training progress diligently – not just training loss, but also validation performance (on unseen data slices from your walk-forward setup), and key trading metrics like cumulative return, drawdown, and Sharpe ratio.
Hyperparameter Iteration: This is where you continuously iterate on your DRL algorithm’s hyperparameters. It’s often more art than science initially. Small changes in learning rate, discount factor, or network size can have massive impacts. Use the advanced tuning methods discussed earlier to systemize this process.
Robustness Checks: Don’t just look at the final equity curve. Analyze individual trades, periods of underperformance, and sudden losses. Did the agent act rationally? Was there a specific market condition that tripped it up? The goal isn’t just high returns, but understandable and repeatable returns.

This phase is iterative. You’ll likely go back and forth between design, training, and analysis many times, refining your environment, reward function, and architecture as you uncover issues.

4. Risk Management: The Safety Net

Even the most sophisticated DRL agent is not immune to unexpected market movements. Incorporating robust risk management strategies isn’t an afterthought; it’s a foundational pillar of any successful trading system. Protecting your capital is paramount.

Stop-Loss Orders: The simplest yet most effective defense. Set predefined exit points for trades if the market moves against your position. DRL agents can be trained to respect these, or you can hard-code them into your execution layer. Knowing your maximum loss per trade before you enter is invaluable.
Position Sizing: Don’t bet the farm on a single trade. Determine the appropriate size for each position based on your total capital and risk tolerance. This often involves strategies like Kelly Criterion (though often too aggressive for real-world use) or fixed fractional sizing (e.g., risking no more than 1-2% of capital per trade).
Portfolio Diversification: Don’t put all your eggs in one crypto basket. Diversify across different assets, or even different trading strategies if you have multiple. If one asset tanks, others might hold steady or even rise.
Maximum Drawdown Limits: Define the maximum percentage loss your portfolio can endure before you stop trading, or reduce exposure significantly. An automated system should have circuit breakers.
Capital Preservation Rules: Implement rules that ensure you maintain a minimum capital level. If your portfolio shrinks below a certain threshold, the system should automatically scale down position sizes or even pause trading until capital recovers.
Black Swan Event Planning: While difficult to predict, consider how your system would react to extreme, rare events. Could you implement a ‘manual override’ or a pause button for such scenarios?

Ignoring risk management is like building a Ferrari without brakes. It might be fast, but it’s destined for a spectacular, and potentially catastrophic, crash. A well-designed DRL trading system integrates these risk controls directly into its decision-making process or at an overarching strategic layer.

The Road Ahead: Confidence in an Uncertain World

By thoughtfully addressing backtest overfitting – that silent killer of trading dreams – and diligently implementing these robust strategies, you can develop DRL models for cryptocurrency trading that are not just theoretically sound but genuinely resilient and reliable. This comprehensive approach doesn’t just enhance your model’s performance; it builds a crucial layer of confidence, allowing you to deploy your DRL-based trading strategies in real-world market environments with a much greater sense of security.

The crypto market will always be the wild west, full of unpredictable twists and turns. But with a well-trained, robust DRL agent by your side, you’re no longer just a cowboy riding blindly into the sunset. You’re an explorer, equipped with the best tools, ready to navigate the landscape, making informed decisions, and hopefully, finding some treasure along the way. It’s a challenging journey, but the rewards for those who master it can be significant. So, go forth, build, test, and conquer, but always, always remember the lessons of overfitting.

References & Further Reading:

Gort, B. J. D., Liu, X.-Y., Gao, J., Chen, S., & Wang, C. D. (2022). Deep Reinforcement Learning for Cryptocurrency Trading: Practical Approach to Address Backtest Overfitting. arXiv preprint. (arxiv.org)
Zhang, C., Vinyals, O., Munos, R., & Bengio, S. (2018). A Study on Overfitting in Deep Reinforcement Learning. arXiv preprint. (arxiv.org)
TegoTrend. (n.d.). How to Overcome Overfitting in Deep Trading Algorithms for More Stable Performance. (tegotrend.com)
Liu, X.-Y., Yang, H., Chen, Q., Zhang, R., Yang, L., Xiao, B., & Wang, C. D. (2020). FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance. arXiv preprint. (arxiv.org)
Song, Z., Jin, X., & Li, C. (2022). Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint. (arxiv.org)
ResearchGate. (n.d.). A Deep Reinforcement Learning Approach for Automated Cryptocurrency Trading. (researchgate.net)