Author Topic: Determining thresholds: ALL ideas welcomed! (Read 2487 times) Tweet Share

TrueTears · « **on:** January 08, 2014, 10:06:00 pm »

The context of this question is actually finance, however the question at heart is a statistical/mathematical one. I've removed all finance jargon and only left in the bare minimum details required. All ideas are welcomed.

Question:

Suppose I have the following expression:

$\rho = \frac{2\bar{x}}{(s^*_x)^2}+1 ......... (1)$

where

$\bar{x} = \frac{1}{T} \sum_{t=1}^{T} x_t$

$(s^*_x)^2 = \frac{1}{T} \sum_{t=1}^T (x_t - \bar{x})^2$

Assume $T$ is some fixed constant. Note: $T$ is just the total number of observations of $x_t$ .

I have data on $x_t$ for each 'entity' (here an entity just simply refers to a firm/company). In total, I have 2228 entities and for each entity I have $T$ observations of $x_t$ .

For each entity, I substitute the $T$ observations of $x_t$ into Eqn. $(1)$ and obtain a value for $\rho$ . Thus in total, I have 2228 values of $\rho$ .

Now, a large value of $\rho$ means the entity is "bad" and a small value of $\rho$ means the entity is "good". However, the problem is how large does a value of $\rho$ have to be in order to classify an entity as "bad"? That is, what is the threshold such that if $\rho$ exceeds the threshold value, then we can classify the entity as "bad"?

For example, let's say the threshold is $400$ , if entity A had a $\rho = 300$ while entity B had a $\rho = 1000$ , then entity A is "good" while entity B is "bad".

My attempts so far:

My first attempt was to try get data on an entity that is known to be "bad" and then calculate its $\rho$ and use this value as the threshold. The problem is that I cannot obtain data on "bad" entities (due to proprietary data and private licensing issues...)

For my next attempt, I obtained the empirical distribution by applying a kernel density estimator (fancy term for just obtaining the histogram and then using an estimation technique to get an estimated probability density function of this histogram) on the 2228 values of $\rho$ . Then I calculate the 99th percentile (for robustness, I also calculated the 97.5th and 95th percentile) of this pooled distribution and use this value as the threshold. However, the main critique is that this is too arbitrary and there is not enough rationale for using this method.

Main problem:

So I am wondering if anyone has any ideas on how what statistical/mathematical techniques/methods I can apply to derive appropriate thresholds for $\rho$ . Currently, I really have no idea on what tools are available for this problem. I have tried an abundant amount of techniques that I could think of: Extreme value theory (doesn't really work because it's used to derive the distributions of maximums), Bayesian sampling (doesn't really apply here since applying priors don't really help in solving the apparent problem), Asymptotic distribution (perhaps this has been the most "successful" to date but falls short because one can only derive an asymptotic distribution of $\rho$ for ONE entity, however I am really after the distribution of $\rho$ of the entire POOLED sample of entities).

psyxwar · « **Reply #1 on:** January 08, 2014, 10:36:36 pm »

#beef

Wouldn't any method of calculating the threshold be arbitrary, considering you don't actually have any data on "bad" entities? I'm no maths whiz and I have no background in finance, so apologies if this seems like a stupid question. But I just don't see how you can calculate a threshold without having any data, apart from "guessing".

TrueTears · « **Reply #2 on:** January 08, 2014, 10:42:03 pm »

^^^^^^ Exactly my feelings.

Now try convince a journal referee to believe that haha...

alondouek · « **Reply #3 on:** January 08, 2014, 10:51:36 pm »

Quote from: TrueTears on January 08, 2014, 10:06:00 pm

My attempts so far:

My first attempt was to try get data on an entity that is known to be "bad" and then calculate its $\rho$ and use this value as the threshold. The problem is that I cannot obtain data on "bad" entities (due to proprietary data and private licensing issues...)

For my next attempt, I obtained the empirical distribution by applying a kernel density estimator (fancy term for just obtaining the histogram and then using an estimation technique to get an estimated probability density function of this histogram) on the 2228 values of $\rho$ . Then I calculate the 99th percentile (for robustness, I also calculated the 97.5th and 95th percentile) of this pooled distribution and use this value as the threshold. However, the main critique is that this is too arbitrary and there is not enough rationale for using this method.

Main problem:

So I am wondering if anyone has any ideas on how what statistical/mathematical techniques/methods I can apply to derive appropriate thresholds for $\rho$ . Currently, I really have no idea on what tools are available for this problem. I have tried an abundant amount of techniques that I could think of: Extreme value theory (doesn't really work because it's used to derive the distributions of maximums), Bayesian sampling (doesn't really apply here since applying priors don't really help in solving the apparent problem), Asymptotic distribution (perhaps this has been the most "successful" to date but falls short because one can only derive an asymptotic distribution of $\rho$ for ONE entity, however I am really after the distribution of $\rho$ of the entire POOLED sample of entities).

Question; when you got the empirical distribution, what approximate shape did it have? If it was approximately normal (or if could even be approximated to normality by some approximation function) then couldn't you just apply a slight bastardisation of the 68-95-99.7 rule using two of those points on the distribution (or any really, if you can carry out the approxmation to normality) as your upper and lower thresholds?

For example, let's take a density function of $f(x) = \frac{1+cos(x)}{2\pi}$ over the interval $[-\pi, \pi]$ . This gives a model (I HOPE) that appears to be approximately normal - and presumably within the range of normality where the 68-95-99.7 rule could apply. With your variables and data sets, you could probably manipulate the above expression (which is admittedly a bit more t-distribution-ish than normal but it's still symmetrical and that appears to be the important part here) to fit your data; then apply the reference rule, no?

Disclaimer: I'm bad at maths lol

TrueTears · « **Reply #4 on:** January 08, 2014, 11:00:33 pm »

Quote from: alondouek on January 08, 2014, 10:51:36 pm

Question; when you got the empirical distribution, what approximate shape did it have? If it was approximately normal (or if could even be approximated to normality by some approximation function) then couldn't you just apply a slight bastardisation of the 68-95-99.7 rule using two of those points on the distribution (or any really, if you can carry out the approxmation to normality) as your upper and lower thresholds?

For example, let's take a density function of $f(x) = \frac{1+cos(x)}{2\pi}$ over the interval $[-\pi, \pi]$ . This gives a model (I HOPE) that appears to be approximately normal - and presumably within the range of normality where the 68-95-99.7 rule could apply. With your variables and data sets, you could probably manipulate the above expression (which is admittedly a bit more t-distribution-ish than normal but it's still symmetrical and that appears to be the important part here) to fit your data; then apply the reference rule, no?

Disclaimer: I'm bad at maths lol

Yup that's what I initially thought out to do as well. Sadly, the empirical distribution is HIGHLY positively skewed, which is to be expected since there aren't too many "bad" entities (you'd hope!) hence the long tail to the right.

Being so skewed, I tried to approximate the density with a mixture of normals, i.e., $f(x) = \lambda_1 N(\mu_1, \sigma_1^2) + \lambda_2 N(\mu_2 \sigma_2^2)$ where $\lambda_1 + \lambda_2 = 1$ . The idea was to weight the normal component centered around the majority of the "peak" more than the weight on the normal component that's approximating the skewed right tail. However, because the data is so positively skewed, the latter normal component has a huge variance, which complicates things.

To be honest, I get very very nice results with just using the 99-97.5-95th percentile. However, such an approach does not seem to please the referee enough

alondouek · « **Reply #5 on:** January 08, 2014, 11:09:33 pm »

Not sure I agree with the referee hahaha, I'd say the standard normal reference rule is just as arbitrary as yours!

I'm wondering if you can't treat your data - given that it's got a significant skew and therefore certain assumptions about normality and equality of SD don't hold true - with non-parametric tests.

Have you thought about applying a Mann-Whitney test or a Wilcoxon Rank Sum test? Also, maybe you could reduce the skew of the histogram by applying a log-transformation to the original data? This should help normalise the data somewhat (potentially)

alondouek · « **Reply #6 on:** January 09, 2014, 02:07:54 am »

I've been thinking about this a bit more and I think that the main issue your referee has is that they can't see a theoretical basis for your 95-97.5-99th percentile as the basis for the threshold value. Given that you're getting good results with this, I think that you could potentially derive a theoretical basis for this in a 'non-mathematical' fashion. Obviously I'm not versed at all in this area but if your data is highly positively skewed for the reason you've provided (being that 'bad' entities are far rarer in your data-set that those with a lower $\rho$ ) then there should be a fairly distinct point of the histogram - and I'm assuming the skew to be steep - where the data begins to move from 'good' entities to 'bad' entities. Let's arbitrarily call this point the "transition state". As such, you could logically state that this point is your threshold.

The whole point of this being useful is that it could back up your initial 95-97.5-99 procedure; if I'm right (and I can't know for sure because I've not seen the histogram), the point at which this "transition state" exists should be between the 95th percentile and the 99th percentile (because of the dramatic skew of the data), and you can therefore label that based on some reasoning as your threshold percentile. (i.e. percentile at which "transition state" exists = threshold)

Otherwise, the only thing I can possibly think of at the moment is what I mentioned above this post about taking the logs of the original data to normalise it somewhat, thereby providing a theoretically clearer threshold value (but it'd still be fairly arbitrary).

TrueTears · « **Reply #7 on:** January 09, 2014, 05:26:54 pm »

Quote from: alondouek on January 08, 2014, 11:09:33 pm

Have you thought about applying a Mann-Whitney test or a Wilcoxon Rank Sum test?

I've tried these two but they don't seem to work very well. Specifically, since my sample is quite large, the former assumes convergence to normality which doesn't really apply here. As for the latter, it assumes independence between two samples, i.e., independence between "good" firms and "bad" firms, which is a very strong assumption to make.

Quote from: alondouek on January 09, 2014, 02:07:54 am

I've been thinking about this a bit more and I think that the main issue your referee has is that they can't see a theoretical basis for your 95-97.5-99th percentile as the basis for the threshold value. Given that you're getting good results with this, I think that you could potentially derive a theoretical basis for this in a 'non-mathematical' fashion. Obviously I'm not versed at all in this area but if your data is highly positively skewed for the reason you've provided (being that 'bad' entities are far rarer in your data-set that those with a lower $\rho$ ) then there should be a fairly distinct point of the histogram - and I'm assuming the skew to be steep - where the data begins to move from 'good' entities to 'bad' entities. Let's arbitrarily call this point the "transition state". As such, you could logically state that this point is your threshold.

The whole point of this being useful is that it could back up your initial 95-97.5-99 procedure; if I'm right (and I can't know for sure because I've not seen the histogram), the point at which this "transition state" exists should be between the 95th percentile and the 99th percentile (because of the dramatic skew of the data), and you can therefore label that based on some reasoning as your threshold percentile. (i.e. percentile at which "transition state" exists = threshold)

The right side of the tail curves down steadily, although the 95-97.5-99 method works very good, the referee seems to want some theoretical modelling framework that either derives a threshold on its own or gives support for this method. Sadly, I don't think a worded argument would work here haha.

Quote

Otherwise, the only thing I can possibly think of at the moment is what I mentioned above this post about taking the logs of the original data to normalise it somewhat, thereby providing a theoretically clearer threshold value (but it'd still be fairly arbitrary).

Lognormalling the data is certainly a good idea, however since my sample contains both positive and negative values, we can only model the positive side of the distribution with a lognormal distribution. I have tried this and omitted the negative values, but they since do make up a fair portion of the left hand side, the results become highly inaccurate.

Thanks for all the ideas though!

alondouek · « **Reply #8 on:** January 09, 2014, 06:11:23 pm »

hmm, this is an interesting dilemma!

As you've pointed out, the best way to utilise the method you've found that actually works is to provide a theoretical basis for the 95-97.5-99 method. Given that the context of this research is finance, I suspect that there would be an underlying financial explanation for your data and the resultant distribution. If you could explain the skew and the shape of the $\rho$ -value histogram based on financial theory then I reckon you could provide a basis of support for your method that lends it credence in a practical setting.

With respect to log-normalling, could you potentially take the absolute values of the negative data, or would that render those datapoints uninterpretable? (I'm kind of disappointed that this didn't work lol).

But yeah, your best bet appears to be to justify your method. Can you think of any set financial reasons why your method may work with the data (unequal distribution of income and profitability between firms and companies, etc. etc.)?

Search the forums now!

We have moved!

Author Topic: Determining thresholds: ALL ideas welcomed! (Read 2487 times) Tweet Share

TrueTears

Determining thresholds: ALL ideas welcomed!

psyxwar

Re: Determining thresholds: ALL ideas welcomed!

TrueTears

Re: Determining thresholds: ALL ideas welcomed!

alondouek

Re: Determining thresholds: ALL ideas welcomed!

TrueTears

Re: Determining thresholds: ALL ideas welcomed!

alondouek

Re: Determining thresholds: ALL ideas welcomed!

alondouek

Re: Determining thresholds: ALL ideas welcomed!

TrueTears

Re: Determining thresholds: ALL ideas welcomed!

alondouek

Re: Determining thresholds: ALL ideas welcomed!

Recent Posts

User:
Password: