There is an extremely high false discovery rate in both the academic and financial industry for trading strategies that “produce” alpha. In fact, most of these strategies are false discoveries due to research bias, multiple testing and the true probability of finding a new investment strategy being very low (<< 1%) due to competition.
As stated by Marcos Lopez de Prado with a true probability of a backtested strategy being profitable at 1%, and 80% power (rate of identifying true strategies), in testing 1000 trading strategies using a standard threshold of significance level at 5% would imply at least 86% false discoveries!
Today we investigate issues of multiple testing and false discovery of a profitable trading strategy. We develop a momentum-based trading strategy on Apple stock and show the issues that can arise from unknowingly completing multiple testing on the same dataset.
Papers discussed:
Evaluating Trading Strategies: https://www.stat.berkeley.edu/~aldous/157/Papers/harvey.pdf
The Pitfalls of Econometric Analysis (Marcos Lopez de Prado): https://www.quantresearch.org/Lectures.htm
Scientific method: Statistical errors: https://www.nature.com/articles/506150a
Moving to a World Beyond “p<0.05”: https://www.tandfonline.com/doi/pdf/10.1080/00031305.2019.1583913?
Why your trading strategy doesn’t work
The Perils of Multiple Testing – p-hacking during backtesting.
Here we use the example of a classic Simple Moving Average Crossover strategy, using Backtrader in Python. https://www.backtrader.com/home/helloalgotrading/
import datetime import time import math import numpy as np import pandas as pd import scipy as sc import matplotlib.pyplot as plt from pandas_datareader import data as pdr import backtrader as bt import quantstats import concurrent.futures as cf from itertools import product %matplotlib widget %matplotlib inline
# import data def get_data(stocks, start, end): stockData = pdr.get_data_yahoo(stocks, start, end) return stockData stockList = ['AAPL'] endDate = datetime.datetime.now() startDate = endDate - datetime.timedelta(days=2000) stockData = get_data(stockList[0], startDate, endDate) stockData = stockData.sort_values(by="Date") len(stockData) stockData_IS = stockData[:int(len(stockData)*0.75)] stockData_OS = stockData[-int(len(stockData)*0.25):] print(len(stockData), len(stockData_IS), len(stockData_OS)) actualStart = stockData.index[0] data = bt.feeds.PandasData(dataname=stockData_IS) print('IS DATA: starting ', stockData_IS.index[0],' finshing ', stockData_IS.index[-1]) print('OS DATA: starting ', stockData_OS.index[0],' finshing ', stockData_OS.index[-1])
Define your trading strategy are a class in python.
# Create a subclass of Strategy to define the indicators and logic class MAcrossover(bt.Strategy): # list of parameters which are configurable for the strategy params = dict( pfast=10, # period for the fast moving average pslow=20 # period for the slow moving average ) def log(self, txt, dt=None): dt = dt or self.datas[0].datetime.date(0) # print(f'{dt.isoformat()} {txt}') # Comment this line when running optimization def __init__(self): sma1 = bt.ind.SMA(period=self.p.pfast) # fast moving average sma2 = bt.ind.SMA(period=self.p.pslow) # slow moving average self.crossover = bt.ind.CrossOver(sma1, sma2) # crossover signal def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # An active Buy/Sell order has been submitted/accepted - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log(f'BUY EXECUTED, {order.executed.price:.2f}') elif order.issell(): self.log(f'SELL EXECUTED, {order.executed.price:.2f}') self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') # Reset orders self.order = None def next(self): if not self.position: # not in the market if self.crossover > 0: # if fast crosses slow to the upside self.buy() # enter long elif self.crossover < 0: # in the market & cross to the downside self.close() # close long position
Define Commission Scheme.
class FixedCommisionScheme(bt.CommInfoBase): paras = ( ('commission', 10), ('stocklike', True), ('commtype', bt.CommInfoBase.COMM_FIXED) ) def _getcommission(self, size, price, pseudoexec): return self.p.commission
Also create a sizing function for each trade, this takes a risk parameter.
class maxRiskSizer(bt.Sizer): ''' Returns the number of shares rounded down that can be purchased for the max rish tolerance ''' # list of parameters which are configurable for the strategy params = dict( prisk=0.35 ) def __init__(self): if self.p.prisk > 1 or self.p.prisk < 0: raise ValueError('The risk parameter is a percentage which must be' 'entered as a float. e.g. 0.5') def _getsizing(self, comminfo, cash, data, isbuy): if isbuy == True: size = math.floor((cash * self.p.prisk) / data[0]) else: size = math.floor((cash * self.p.prisk) / data[0]) * -1 return size
Now we can perform a number of runs, ‘tuning’ hyperparameters for the best in-sample performance and then we will test the best combination (highest return over trading period) out-of-sample with testing set.
optimized_runs = {} def run(data, params, graph=False, benchmark=False): #Add Data cerebro = bt.Cerebro() cerebro.adddata(data) #Analyzers cerebro.addanalyzer(bt.analyzers.AnnualReturn) cerebro.addanalyzer(bt.analyzers.DrawDown) cerebro.addanalyzer(bt.analyzers.SharpeRatio_A, _name='sharpe_ratio') cerebro.addanalyzer(bt.analyzers.PyFolio, _name='PyFolio') # Broker Information broker_args = dict(coc=True) cerebro.broker = bt.brokers.BackBroker(**broker_args) comminfo = FixedCommisionScheme() cerebro.broker.addcommissioninfo(comminfo) cerebro.broker.set_cash(10000) # Add Strategy if benchmark: cerebro.addstrategy(BuyAndHold) else: cerebro.addstrategy(MAcrossover, pfast=params[0],pslow=params[1]) #Default position size cerebro.addsizer(maxRiskSizer, prisk=params[2]) strats = cerebro.run() if graph: cerebro.plot(iplot=False, style='candlestick') return strats pfast=range(5,25,5) pslow=range(50,110,10) prisk=np.linspace(0.1,1,9) params = list(product(pfast, pslow, prisk)) # params start = time.time() for param in params: print(param) optimized_runs[param] = run(data, param) end = time.time() print(' time taken {:.2f} s'.format(end-start))
Now ‘optimise’ the runs from above.
final_results_list = [] for runs in optimized_runs: for strategy in optimized_runs[runs]: PnL = round(strategy.broker.get_value() - 10000,2) sharpe = strategy.analyzers.sharpe_ratio.get_analysis() final_results_list.append([strategy.params.pfast, strategy.params.pslow, round(runs[2],1), PnL, round(sharpe['sharperatio'],2)]) sort_by_sharpe = sorted(final_results_list, key=lambda x: x[3], reverse=True) sort_by_sharpe = sorted(sort_by_sharpe, key=lambda x: x[4], reverse=True) for line in sort_by_sharpe[:]: print(line) len(optimized_runs)
Now we can test out-of-sample and see how we’ve done. Then we can understand why we need to adjust sharpe ratio t-statistic to account for multiple testing!!!
data_OS = bt.feeds.PandasData(dataname=stockData_OS) strats = run(data,(5,50,1), graph=True) strat_0 = strats[0] portfolio_stats = strat_0.analyzers.getbyname('PyFolio') returns, positions, transactions, gross_lev = portfolio_stats.get_pf_items() vol = np.std(returns)*np.sqrt(252) returns.index = returns.index.tz_convert(None) PnL = round(strat_0.broker.get_value() - 10000,2) sharpe = strat_0.analyzers.sharpe_ratio.get_analysis() print('PnL $ : ',round(PnL,2)) print('Sharpe Ratio : ',round(sharpe['sharperatio'],2)) print('Annualised Volatility %: ', round(vol*100,2))
t_statistic = sharpe['sharperatio']*np.sqrt(len(stockData_IS)/252) print('Our T-statistic: ',round(t_statistic,2)) print('p_value ', round(sc.stats.t.pdf(t_statistic,999),3)) print('T_crit at 5% CI', round(sc.stats.t.ppf(1-0.05,999),3))
Is this statistically significant?
Here we ask ourselves, is this statistically different from a result of a portfolio that has a sharpe ratio of 0.
t_statistic = sharpe['sharperatio']*np.sqrt(len(stockData_IS)/252) print('Our T-statistic: ',round(t_statistic,2)) print('p_value ', round(sc.stats.t.pdf(t_statistic,999),3)) print('T_crit at 5% CI', round(sc.stats.t.ppf(1-0.05,999),3))
Based on this we would assume that our result is significant, BUT we need to account for multiple testing. When we ‘tuned’ our hyperparameters of the model, this is essentially performing multiple testing on the same dataset. This is a BAD research practice, and is one of the key things Dr Marcos Lopez de Prado preaches against, because it’s seen again an again in the finance industry.
Below we use the Bonferroni adjustment in our p-value to see if our result is significant. We divide the significance value by 216 as this is the number of tests we completed (permutations of hyperparaters) on the same dataset.
print('T_crit at adjusted Bonferroni t-stat', round(sc.stats.t.ppf(1-0.05/216,999),3))
Please remember this is an example of what not to do! Don’t complete multiple testing without changing the p_value appropraitely please.