Australian (ASX) Stock Market Forum

Sample sizes

Joined
21 April 2005
Posts
3,922
Reactions
5
I am nutting out a strategy by looking at charts manually and with a pen making notes. In the absense of a computer knowledge and programming know how what kind of sample size would be sufficient to ascertain if the strategy works before using more samples to validate it and then paper trade it?

200 different stocks?
100 different stocks but with multiple times focussed on?
10 stocks with their life history studied?

I am interested in the responses here from all who have dabbled in testing or manually tested with paper and pencil.
 
I don't know the answer based on statistics (but would also like to know), however, I offer the following;

30 trades seems to be a popular number to use to validate.

I am paper trading a strategy at the moment which backtested very well. It produces approximately 30 trades per month. 10 years worth of backtesting revealed the worst run of losses to be 3 losing months in a row and there was never a losing year.

I'm therefore paper trading this to 100 trades as I'm not convinced 30 trades over 1 month is sufficient to validate the system.
 
I don't know the answer based on statistics (but would also like to know), however, I offer the following;

30 trades seems to be a popular number to use to validate.

I am paper trading a strategy at the moment which backtested very well. It produces approximately 30 trades per month. 10 years worth of backtesting revealed the worst run of losses to be 3 losing months in a row and there was never a losing year.

I'm therefore paper trading this to 100 trades as I'm not convinced 30 trades over 1 month is sufficient to validate the system.

Thanks for that Michael, I appreciate your response.

Do you paper trade using live data or historical data?
 
Hi Snake --

In my opinion ---

If the trades being analyzed are from an in-sample run, then there is no number of trades that are adequate to validate a trading system. I have documented systems (plural) that each have over one million closed trades in the in-sample period, yet are not profitable out-of-sample.

If the trades being analyzed come from uncontaminated out-of-sample runs, then statistical tests can be performed using any number of trades. The more trades, the better. You will need to set up a benchmark against which your system is to be tested. A common benchmark is random; another is zero profit.

To test against random. This essentially a test against buy and hold with the same exposure as your system.

Examine the trades from your system. Note the characteristics of each trade, such as percent profit, number of bars / days held, and so forth. Those two might be enough to start. Make random trades from the same price series that you are testing, picking the entries at random so that there are about the same number of trades during each period -- for example, if your system makes six trades a year, have the random system make six entries per year. Have the random system hold as long as your system holds. If you system holds an average of 10 days with a standard deviation of 5 days, have the random system do the same.

Create a spreadsheet with the profit from each of your trades in one column and the profit from each of the random trades in a second column. Using a spreadsheet such as Excel, call the data analysis routines and ask for a t-test with common variance or equal variance between the two columns. If your results are statistically better than the random results, the t score that is reported will show that.

Since the random results could be different if different entry dates were selected, run the random system several times.

The reason that 30 is often mentioned is that a couple of things happen when more than 30 data items are analyzed.
One -- a sample of less than about 30 items is treated as a "small" sample, and a correction term is used -- the square root of the number of terms is adjusted by one. When more than 30 items are in the sample, the correction is less important and is generally not used.
Two -- observations drawn from almost any distribution tend to be distributed according to the "normal" distribution when there are 30 or more of them. That means that assumptions of normality can be made and statistical techniques are less restricted.

But -- not to worry -- if you have fewer than 30 data points (from out-of-sample results), the tests can still be run. It is just harder to get a t-test (or other test) to show statistical significance with small samples.

Thanks for listening,
Howard
 
Hi Snake --

In my opinion ---

If the trades being analyzed are from an in-sample run, then there is no number of trades that are adequate to validate a trading system. I have documented systems (plural) that each have over one million closed trades in the in-sample period, yet are not profitable out-of-sample.

If the trades being analyzed come from uncontaminated out-of-sample runs, then statistical tests can be performed using any number of trades. The more trades, the better. You will need to set up a benchmark against which your system is to be tested. A common benchmark is random; another is zero profit.

To test against random. This essentially a test against buy and hold with the same exposure as your system.

Examine the trades from your system. Note the characteristics of each trade, such as percent profit, number of bars / days held, and so forth. Those two might be enough to start. Make random trades from the same price series that you are testing, picking the entries at random so that there are about the same number of trades during each period -- for example, if your system makes six trades a year, have the random system make six entries per year. Have the random system hold as long as your system holds. If you system holds an average of 10 days with a standard deviation of 5 days, have the random system do the same.

Create a spreadsheet with the profit from each of your trades in one column and the profit from each of the random trades in a second column. Using a spreadsheet such as Excel, call the data analysis routines and ask for a t-test with common variance or equal variance between the two columns. If your results are statistically better than the random results, the t score that is reported will show that.

Since the random results could be different if different entry dates were selected, run the random system several times.

The reason that 30 is often mentioned is that a couple of things happen when more than 30 data items are analyzed.
One -- a sample of less than about 30 items is treated as a "small" sample, and a correction term is used -- the square root of the number of terms is adjusted by one. When more than 30 items are in the sample, the correction is less important and is generally not used.
Two -- observations drawn from almost any distribution tend to be distributed according to the "normal" distribution when there are 30 or more of them. That means that assumptions of normality can be made and statistical techniques are less restricted.

But -- not to worry -- if you have fewer than 30 data points (from out-of-sample results), the tests can still be run. It is just harder to get a t-test (or other test) to show statistical significance with small samples.

Thanks for listening,
Howard

Hi Howard,

I appreciate your response and expert opinion.

What do you mean by "contaminated"? Could you explain what a contaminated out of sample run would be?
 
Hi Snake --

The process of designing a trading system is:
1. Search / optimize/ tweak / etc over an in-sample period and generate a list of alternatives. Each alternative is a different combination of logic and parameter values. Each time the system is run over the in-sample data, new information is learned and the system is fine-tuned, then run again.
2. Pick the "best" alternative. Best is defined according to an objective function that you select and incorporates the features you prefer in a trading system. Good features are rapid equity growth, smooth equity growth, and so forth. Poor features are high drawdowns, and so forth. The features that are best for you may not be best for me. Whatever features you build in to your objective function will rewarded by those alternatives that are at the top of the list.
3. Using a set of data that immediately follows the in-sample data, run the alternative that scored best and generate the trades for that period. The period, and the trades, are called the out-of-sample period and out-of-sample trades.

You already knew all this.

Most people, maybe all people, do not get the system the way they want it on the first pass through. So they look not only at the in-sample results, but also the out-of-sample results. If the out-of-sample results are examined one time, followed by a decision to trade the system or go back to the drawing board, then the out-of-sample results are the best estimates of future trading that are available. If the system is re-tweaked after looking at the out-of-sample results, then the information that came from the out-of-sample data becomes incorporated into the system and that previously out-of-sample data becomes incorporated into the in-sample data. This is the contamination I am talking about.

It takes very little "peeking" at the out-of-sample results, followed by adjustment of the trading system, to contaminate the validation process.

Some systems designers anticipate this, and use three sets of data. The first set is the normal in-sample data. The second set immediately follows the in-sample data and is used to "guide" the development. It is used to adjust the system before going back to the in-sample data for more work. The third set of data immediately follows the second and is the true out-of-sample data. That is used one time, and one time only, to make the go / no go decision.

Thanks,
Howard
 
Hi Snake --
That is used one time, and one time only, to make the go / no go decision.
Thanks,
Howard

Strangely the concept of "back fitting" is better understood in racing circles.
One can develop a "hum dinger" of a racing system when the results and R/R (odds) are known in advance.

Change the terminology from betting to trading, system to methodology, odds to R/R, fixed % betting to fixed % risk, staking plans to risk management and so on and so on, but two killers remain constant. MaxDD and not knowing when you are back fitting your method to better encompass what happened in the past.
 
Hi Snake --

The process of designing a trading system is:
1. Search / optimize/ tweak / etc over an in-sample period and generate a list of alternatives. Each alternative is a different combination of logic and parameter values. Each time the system is run over the in-sample data, new information is learned and the system is fine-tuned, then run again.
2. Pick the "best" alternative. Best is defined according to an objective function that you select and incorporates the features you prefer in a trading system. Good features are rapid equity growth, smooth equity growth, and so forth. Poor features are high drawdowns, and so forth. The features that are best for you may not be best for me. Whatever features you build in to your objective function will rewarded by those alternatives that are at the top of the list.
3. Using a set of data that immediately follows the in-sample data, run the alternative that scored best and generate the trades for that period. The period, and the trades, are called the out-of-sample period and out-of-sample trades.

You already knew all this.

Most people, maybe all people, do not get the system the way they want it on the first pass through. So they look not only at the in-sample results, but also the out-of-sample results. If the out-of-sample results are examined one time, followed by a decision to trade the system or go back to the drawing board, then the out-of-sample results are the best estimates of future trading that are available. If the system is re-tweaked after looking at the out-of-sample results, then the information that came from the out-of-sample data becomes incorporated into the system and that previously out-of-sample data becomes incorporated into the in-sample data. This is the contamination I am talking about.

It takes very little "peeking" at the out-of-sample results, followed by adjustment of the trading system, to contaminate the validation process.

Some systems designers anticipate this, and use three sets of data. The first set is the normal in-sample data. The second set immediately follows the in-sample data and is used to "guide" the development. It is used to adjust the system before going back to the in-sample data for more work. The third set of data immediately follows the second and is the true out-of-sample data. That is used one time, and one time only, to make the go / no go decision.

Thanks,
Howard

Thanks Howard I appreciate the effort. You have intelligence and the ability to clearly explain things. :)
Expertise is a good thing.
Regards
 
Top