This post is based on Bishara and Hittner's article (2012), introduced to me by Dr. Takashi Yamauchi. Correlation is simple yet one of the most important tools in establishing the relationship between two variables. However, if Pearson's r is used and the data is non-normally distributed, the amount of Type I erros (false positives, i.e. seeing correlation where there is truly no correlation) might increase dramatically. To offset this, the authors recommend a data normalization technique called Rank-Based Inverse Normal transformation (RIN).
This transformation is peformed according to the following formula:
\[ f(x) = \Phi^{-1} \Big(\frac{x_{r} - \frac{1}{2}} {n} \Big), \]
where "$x_{r}$ is ascending rank of $x$, such that $x_{r} = 1 $ for the lowest value of $x$" (p. 401), $\Phi^{-1}$ is the inverse normal cumulative distribution function and $n$ is the number of observations (sample size). Let's now create a script to calculate it in Python. Note that ppf, percent point function, is an alternative name for the quantile function:
from scipy.stats import norm
import pandas as pd
def rinfunc(ds):
ds_rank = ds.rank()
numerator = ds_rank - 0.5
par = numerator/len(ds)
result = norm.ppf(par)
return result
Let's test it.
import pandas as pd
d = {'one' : pd.Series([10, 25, 3, 11, 24, 6]),
'two' : pd.Series([10, 20, 30, 40, 80, 70]),
'index': ['p','r','o','g','r','a']}
df = pd.DataFrame(d)
df.set_index('index', inplace = True)
df['one_transformed'] = rinfunc(df['one'])
df['two_transformed'] = rinfunc(df['two'])
df
one | two | one_transformed | two_transformed | |
---|---|---|---|---|
index | ||||
p | 10 | 10 | -0.210428 | -1.382994 |
r | 25 | 20 | 1.382994 | -0.674490 |
o | 3 | 30 | -1.382994 | -0.210428 |
g | 11 | 40 | 0.210428 | 0.210428 |
r | 24 | 80 | 0.674490 | 1.382994 |
a | 6 | 70 | -0.674490 | 0.674490 |
References
- Bishara, A. J., & Hittner, J. B. (2012). Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches. Psychological methods, 17(3), 399.