Introductory statistics

Nathalie Vialaneix

Sunday, March 24, 2024

Introduction, definition, types of variables

An elementary map of statistics

Before you start: tidy your data!

clean data: one sample in each row, one variable in each column, one value in each cell

“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.” Hadley Wickham (2014) Tidy Data. Journal of Statistical Sofware, 59(10).

Types of variables

numeric (discrete or continuous)
non numeric (ordered or not)

to how variables are encoded!

Univariate statistics (and graphics)

Statistics (e.g., numerical characteristics)

Purpose: summarize a series of values by one numeric value

central characteristics indicateurs de tendance centrale
dispersion characteristics indicateurs de dispersion

Central characteristics

Mean (Moyenne): \(\overline{X} = \frac{1}{n} \sum_{i=1}^n x_i\)
Median (Médiane): value that splits the sample into two subsamples with equal sizes
Mode (Mode): most frequently observed values
Quartiles (Quartiles): 3 values that split the sample into 4 equal size subsamples
Deciles (Déciles): 9 values that split the sample into 10 equal size subsamples
Percentiles (Percentiles): 99 values that split the sample into 100 equal size subsamples
Quantiles (Quantiles): generalize the others

Never heard of percentiles?

Mean and median: the mean is not robust!

How to increase the mean salary in a company?

Mean and median: the mean is not robust!

How to increase the mean salary in a company?

increase all salaries of \(x\)%
increase the salary of the most well paid person
suppress a few jobs with low salaries

Are central characteristics enough?

10 marks for 5 students: same mean, same median

Dispersion characteristics

variance: average squared distance to the mean

\(\mbox{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{X})^2\)

and standard deviation (écart type): \(\sigma_X = \sqrt{\mbox{Var}(X)}\)

Dispersion characteristics

variance: average squared distance to the mean

\(\mbox{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{X})^2\)

and standard deviation (écart type): \(\sigma_X = \sqrt{\mbox{Var}(X)}\)

range (étendue): difference between the largest and the smallest values
inter-quartile range (écart inter-quartile): difference between the 1st and the 3rd quartiles (half of the observations lie between these two quantities)

A few properties of the standard deviation

positive (or null if all the observations have the same value)
does not change when values are translated
sensible to extreme values (like the mean)
expressed in the same unit than the original variable (like the mean)

A few properties of the standard deviation

Consequences

mean and standard deviation can be added (confidence interval)
they can also be divided:

\(\mbox{CV}(X) = \frac{\sigma_X}{\overline{X}}\)

(coefficient de variation) can be used to compare the respective variability of two series

Example of the impact of variability

« Les filles brillent en classe, les garçons aux concours

LE MONDE | 07.09.09 - Article paru dans l’édition du 08.09.09. Philippe Jacqué

Elles obtiennent de meilleurs résultats en cours de scolarité, mais réussissent moins bien les concours des meilleures grandes écoles que les hommes.

[…]

Pour vérifier [cette hypothèse], trois économistes - Evren Örs, professeur à HEC, Eloïc Peyrache, directeur d’HEC, et Frédéric Palomino, ancien de l’école parisienne et actuel professeur associé à l’Edhec Lille - ont étudié à la loupe les résultats obtenus entre 2005 et 2007 au concours d’admission en première année d’HEC, une des écoles de management les plus réputées.

[…]

« D’un point de vue technique, il semble que la structure du concours HEC crée d’avantage d’hétérogénéité chez les hommes que chez les femmes », estime M. Peyrache. Si, « en moyenne », les performances des hommes et des femmes sont similaires, « les notes des femmes sont concentrées autour de la moyenne, tandis que celles des hommes sont très dispersées avec beaucoup de très bonnes notes et de très mauvaises. Mécaniquement, quand on sélectionne les 380 premiers résultats, on a un peu plus d’hommes ».

Standard modifications of data

binarization of a numeric variable (discrétisation): transform a numeric variable into a factor by:
- creating intervals of equal width
- creating intervals of equal number of observations (how to do that?)
- other…

Standard modifications of data

binarization of a numeric variable (discrétisation): transform a numeric variable into a factor by:
- creating intervals of equal width
- creating intervals of equal number of observations (how to do that?)
- other…

Which solution sounds the best? What are the advantages/drawbacks of such a transformation?

Standard modifications of data

centering and scaling (to unit variance) (centrage et réduction), often called Z-score:
- centering: removing the mean
- scaling: dividing by the standard deviation

\(z_i = \frac{x_i - \overline{X}}{\sigma_X}\)

After centering and scaling, the mean of the variable is 0 and its standard deviation is 1.

Standard modifications of data

centering and scaling (to unit variance)

Log transformations

useful for asymmetric distribution to make the variable fit a Gaussian distribution (after transformation) \(\Rightarrow\) often performed before tests
useful for ratios (because a value twice or half the other have the same log with opposite signs)

Log transformations

for \(p\)-values, \(\log_{10}\) is often used
most frequent logs:
- \(y = \log_2(x) \Leftrightarrow x = 2^y\)
- \(y = \log_{10}(x) \Leftrightarrow x = 10^y\)
- \(y = \ln(x) \Leftrightarrow x = \exp(y)\)

Other transformations

compute ratios
normalization
other functions ( \(\sqrt{.}\) , …)

Display a series of values with a chart

In theory, a graphic should:

show the data
help looking at it and understanding the data structure somehow
avoid data distorsion
plot many data in a simple way

References

Edward Tufte (1983) The Visual Display of Quantitative Information, Graphics Press.
http://r-graph-gallery.com/

Common graphics for univariate analyses

The type of chart depends on the variable type!

Common graphics for univariate analyses

factors:
- pie charts
- bar charts
- spider charts