# Need help with a stats problem



## village idiot (29 August 2012)

I am baffled by something of a statistical nature and so would aprreciate help from mazza or any other stats guru, thanks

suppose I have 2000 lines of data in excel, each line being the daily return/change for some security. I want to put a figure on the volatility based on daily returns over this whole period. I have done that a couple of differnt ways and I am baffled by why i am getting different results

one easy method is to take the SD  for the whole sample (or should that be population) using STDEV.P or STDEV.S of lines 1 to 2000, and convert that to HV. simple enough.

another possible method is to break it into 20 day lots, work out HV20 for each 20 day period, and then average the HV of those 100 periods. This gives 'the average HV20 over the last 2000 days'. This *always gives a lower figure than method one*.

as an experiment i tried a middle way, breaking it into 4 differnt 500 line lots, taking the 4 resulting HV of each and averaging them. this results in a figure somewhere between the other two. 

the differences are significant enough to matter eg  one security gives HV of 21.3%/19.8%/17.9% for methods 1/3/2
a currency pair gives 13.07%/12.1%/10.56% 


for the life of me I cannot see why this should be so, since they are measuring the distribution of exactly the same daily returns. I have eliminated as a cause the following;

it isnt that shorter samples cause the average to vary from zero more in a strong trend, meaning each deviation is on average less; this does happen but eliminating it didnt make much difference
it isnt an issue with using population or sample stdev; again it doesnt make much difference which one is used over 2000 lines and for the shorter timeframes i am using sample anyway
i thought it might be the outlier days having a bigger effect on the whole sample v being lost in a couple out of 100 smaller samples, but eliminating the outlier days made no difference
using  log returns or straight returns does not make the differnce


Now if I try the same experiment using random numbers in place of daily returns, the observed effect does not happen; all methods give pretty much the same answer, which is what i expected to happen with stocks but it didnt. I therefore suspect it must be something to do with the normalish distribution characterising security returns, but i cant figure out exactly what.

Hoping someone can shed any light on this for me, thanks 

VI


----------



## craft (29 August 2012)

Are you using a geometric average to average the smaller periods?

Stock volatility clusters whereas random number generation does not - the outcome from averaging, (standard as opposed to geometrical) will make a difference because of this. 

But I've never done stats at school so am probably barking up the wrong tree.


----------



## village idiot (29 August 2012)

Thanks. I was using arithmetic average, but have now tried using GEOMEAN, and if anything that makes the gap bigger, so thats another thing it isnt. 

Yes i am aware volatility clusters, but still dont get why those clusters seem to make more difference in the large sample than in the sum of the small samples. or to look at it the other way, clusters of low volatility making more difference in the small samples than the large one?

thanks anyway


----------



## McLovin (29 August 2012)

Did you calculate the daily price change yourself from the raw price data or was it already there?


----------



## village idiot (29 August 2012)

calculated myself,  in one column. the other methods calcs all access the same column


----------



## McLovin (29 August 2012)

village idiot said:


> calculated myself,  in one column. the other methods calcs all access the same column




Try rounding daily change calculation to three or four dp and see if it helps.


----------



## village idiot (29 August 2012)

McLovin said:


> Try rounding daily change calculation to three or four dp and see if it helps.





now tried, failed, thanks


----------



## craft (29 August 2012)

Second Guess.

Standard deviation is the deviation from the ‘n’ mean.  Given two different time frames you are working with two different means.

If volatility is clustering in the shorter time frames then the two means could be substantially different. Without being prone to clustering , random number generation shouldn’t have such differing means between the time frames.


----------



## McLovin (29 August 2012)

village idiot said:
			
		

> now tried, failed, thanks




No worries, I thought it might have been a floating point issue.

Could you post the data? Might be easier to have a play around with it.


----------



## Punta (29 August 2012)

village idiot said:


> shorter samples cause the average to vary from zero more in a strong trend ... eliminating it didnt make much difference
> 
> Now if I try the same experiment using random numbers in place of daily returns, the observed effect does not happen;
> 
> VI




I would imagine that the first item in your list is the cause.  How did you "eliminate" it?  

What happens if you add some low-frequency variability (or trend) to your random numbers?


----------



## skc (29 August 2012)

Do your random numbes have +ve and -ve like your return population?


----------



## village idiot (29 August 2012)

thanks for suggestions guys



Punta said:


> I would imagine that the first item in your list is the cause.  How did you "eliminate" it?
> 
> What happens if you add some low-frequency variability (or trend) to your random numbers?




I eliminated it by using columns to 'manually' calcluate the dispersion from the mean on each line as calculated by excel , then i replaced the means with zero. Since now they are both using the same 'mean' = zero, the dispersions for each line are identical, in fact i can now use one column. so then if i use that column , i still get differing answers depending on what i do with it.
not sure about your second suggestion



> Do your random numbes have +ve and -ve like your return population?



yes, took .5 away from each random number so they range from -.5 to +.5, presumably with a flat distribution




> Could you post the data? Might be easier to have a play around with it.




the raw data is any time series, and not much point in posting the spreadsheet without the formulas, but i can email the spreadsheet if youre interested . or post a link to it - how do I do that?


----------



## craft (29 August 2012)

village idiot said:


> I eliminated it by using columns to 'manually' calcluate the dispersion from the mean on each line as calculated by excel , then i replaced the means with zero. Since now they are both using the same 'mean' = zero, the dispersions for each line are identical, in fact i can now use one column. so then if i use that column , i still get differing answers depending on what i do with it.
> not sure about your second suggestion




Still on guess two.

10
9
10
8
11
10
9
8
10
9
9
10
10
11
9
10
9
10
9
9

This population mean is 9.5 and your deviation is going to be less measured against 9.5 then it is if your whole population mean is say 12. Can't get my head around what forcing the mean to zero does.


----------



## village idiot (29 August 2012)

I still dont think its the answer to this problem, but is interesting nonetheless. consider this table;
col 1 is a sample of daily changes with a mean of .65, representing a pretty strong uptrend within a larger sample that we will say has a mean change of 0, and lets say a mean absolute change of.3
Col2 measures the dispersion against the sample mean of .65, col3 measures it against the larger sample mean of 0
The SD using the smaller sample mean is smaller than the SD using  the larger sample mean, even though daily movements during this period were all larger than the usual mean abs change, and the overall change for the period was also large. In reality it would be quite a volatile period, certainly you would think so if you were short volatility unhedged,yet HV measured using the mean would give a falsely low reading, and i think this would hold in any strong trend.  Forcing the mean to zero would in this case probably show a truer representation of HV, although not the one in the book


	          mean .65	mean 0
	0.5	0.08	0.25
	1	0.58	1
	1	0.58	1
	1	0.58	1
	1.5	1.08	2.25
	-1	-1.42	1
	2	1.58	4
	1	0.58	1
	-1	-1.42	1
	1	0.58	1
	1	0.58	1
	-1	-1.42	1
	2	1.58	4
	1	0.58	1
	-1	-1.42	1
	-1	-1.42	1
	2	1.58	4
	1	0.58	1
	1	0.58	1
	1	0.58	1
avg;	0.65		
	variance       4.55	29.5
	SD	   2.13     5.43

sorry for table formatting!


----------



## Huskar (3 September 2012)

I am rather clueless when it comes to rigorous statistical methods but I have done some reading demonstrating why standard deviation and the bell curve does not explain asset prices over time: Nassim Taleb et al. 

My guess is that over a larger sample size you are more likely to have an outsized or "Black Swan" event, but now that I read your first post more carefully I see that when you excluded the outliers there was no change in the result so I don't know. 

Would be very interested if you work out a solution!

Perhaps if you see whether the same anomaly applies to other data sets - standard-distributed data like height or age, vs others such as temperature over time or wealth-distribution etc.? Is it something to do with fractals?


----------

