Maths | Physics | Humanities | |
---|---|---|---|
Male | 100 | 250 | 150 |
Female | 20 | 50 | 30 |
Sunday, March 24, 2024
clean data: one sample in each row, one variable in each column, one value in each cell
“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.” Hadley Wickham (2014) Tidy Data. Journal of Statistical Sofware, 59(10).
numeric (discrete or continuous)
non numeric (ordered or not)
to how variables are encoded!
Purpose: summarize a series of values by one numeric value
central characteristics indicateurs de tendance centrale
dispersion characteristics indicateurs de dispersion
Mean (Moyenne): \(\overline{X} = \frac{1}{n} \sum_{i=1}^n x_i\)
Median (Médiane): value that splits the sample into two subsamples with equal sizes
Mode (Mode): most frequently observed values
Quartiles (Quartiles): 3 values that split the sample into 4 equal size subsamples
Deciles (Déciles): 9 values that split the sample into 10 equal size subsamples
Percentiles (Percentiles): 99 values that split the sample into 100 equal size subsamples
Quantiles (Quantiles): generalize the others
How to increase the mean salary in a company?
How to increase the mean salary in a company?
increase all salaries of \(x\)%
increase the salary of the most well paid person
suppress a few jobs with low salaries
10 marks for 5 students: same mean, same median
\(\mbox{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{X})^2\)
and standard deviation (écart type): \(\sigma_X = \sqrt{\mbox{Var}(X)}\)
\(\mbox{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{X})^2\)
and standard deviation (écart type): \(\sigma_X = \sqrt{\mbox{Var}(X)}\)
range (étendue): difference between the largest and the smallest values
inter-quartile range (écart inter-quartile): difference between the 1st and the 3rd quartiles (half of the observations lie between these two quantities)
positive (or null if all the observations have the same value)
does not change when values are translated
sensible to extreme values (like the mean)
expressed in the same unit than the original variable (like the mean)
Consequences
mean and standard deviation can be added (confidence interval)
they can also be divided:
\(\mbox{CV}(X) = \frac{\sigma_X}{\overline{X}}\)
(coefficient de variation) can be used to compare the respective variability of two series
« Les filles brillent en classe, les garçons aux concours
LE MONDE | 07.09.09 - Article paru dans l’édition du 08.09.09. Philippe Jacqué
Elles obtiennent de meilleurs résultats en cours de scolarité, mais réussissent moins bien les concours des meilleures grandes écoles que les hommes.
[…]
Pour vérifier [cette hypothèse], trois économistes - Evren Örs, professeur à HEC, Eloïc Peyrache, directeur d’HEC, et Frédéric Palomino, ancien de l’école parisienne et actuel professeur associé à l’Edhec Lille - ont étudié à la loupe les résultats obtenus entre 2005 et 2007 au concours d’admission en première année d’HEC, une des écoles de management les plus réputées.
[…]
« D’un point de vue technique, il semble que la structure du concours HEC crée d’avantage d’hétérogénéité chez les hommes que chez les femmes », estime M. Peyrache. Si, « en moyenne », les performances des hommes et des femmes sont similaires, « les notes des femmes sont concentrées autour de la moyenne, tandis que celles des hommes sont très dispersées avec beaucoup de très bonnes notes et de très mauvaises. Mécaniquement, quand on sélectionne les 380 premiers résultats, on a un peu plus d’hommes ».
Which solution sounds the best? What are the advantages/drawbacks of such a transformation?
\(z_i = \frac{x_i - \overline{X}}{\sigma_X}\)
After centering and scaling, the mean of the variable is 0 and its standard deviation is 1.
useful for asymmetric distribution to make the variable fit a Gaussian distribution (after transformation) \(\Rightarrow\) often performed before tests
useful for ratios (because a value twice or half the other have the same log with opposite signs)
for \(p\)-values, \(\log_{10}\) is often used
most frequent logs:
\(y = \log_2(x) \Leftrightarrow x = 2^y\)
\(y = \log_{10}(x) \Leftrightarrow x = 10^y\)
\(y = \ln(x) \Leftrightarrow x = \exp(y)\)
compute ratios
normalization
other functions ( \(\sqrt{.}\) , …)
In theory, a graphic should:
show the data
help looking at it and understanding the data structure somehow
avoid data distorsion
plot many data in a simple way
References
Edward Tufte (1983) The Visual Display of Quantitative Information, Graphics Press.
The type of chart depends on the variable type!
Try to keep the lie factor close to 1: \(\frac{\textrm{effect size on graphic}}{\textrm{effect size in data}}\)
Try to keep the lie factor close to 1: \(\frac{\textrm{effect size on graphic}}{\textrm{effect size in data}}\)
Try to keep the lie factor close to 1: \(\frac{\textrm{effect size on graphic}}{\textrm{effect size in data}}\)
Try to keep the lie factor close to 1: \(\frac{\textrm{effect size on graphic}}{\textrm{effect size in data}}\)
Try to keep the lie factor close to 1: \(\frac{\textrm{effect size on graphic}}{\textrm{effect size in data}}\)
Une année de présidence Hollande en chiffres (Le Monde, 05/06/2013)
Comments by readers:
Try to keep the lie factor close to 1: \(\frac{\textrm{effect size on graphic}}{\textrm{effect size in data}}\)
… with variants (row / column profiles, percentages, …)
Maths | Physics | Humanities | |
---|---|---|---|
Male | 100 | 250 | 150 |
Female | 20 | 50 | 30 |
How to summarize factor vs factor?
Properties:
\(0 \leq V \leq 1\)
\(V = 0 \Leftrightarrow X\) and \(Y\) are perfectly independent
\(V = 1 \Leftrightarrow\) knowing the value of \(X\) (resp. \(Y\)), you know the value of \(Y\) (resp. \(X\))
/!\ Cramer’s \(V\):
is a descriptive statistics (does not provide any evidence of meaningful relation)
tends to be biased (overestimates the strength of the relation)
How to summarize numeric vs numeric?
Properties:
sum of negative green areas and positive blue areas
scales as the product of the variable units \(\Rightarrow\) coefficient of correlation
Properties:
ranges between -1 and 1, with \(\pm 1\) indicating perfect linear correlations
positive when the two variables vary in the same direction is sensitive to outliers and can only detect linear relations
Properties:
ranges between -1 and 1, with \(\pm 1\) indicating identical or opposite ranks between the two variables
positive when the two variables vary in the same direction is not sensitive to outliers and can detect any monotonic relation
With \(n=50\) observations the correlation coefficient is significatively different from 0 at approximately 0.27 (for a 5% risk).
Spurious correlations:
Spurious correlations:
©REUTERS/Denis Balibouse
La revue médicale New England Journal of Medicine vient de publier une étude qui fait le lien entre une forte consommation de chocolat et l’attribution des Nobel. New England Journal of Medicine, hebdomadaire américain, publié depuis 1812, est considéré comme la revue médicale la plus prestigieuse. […]
Le docteur Franz Messerli, de l’université Columbia à New York et auteur de l’étude, explique « qu’il y a une corrélation significative surprenante entre la consommation de chocolat per capita et le nombre de lauréats du Nobel pour dix millions d’habitants dans un total de 23 pays ». […]
Seule exception de l’étude : la Suède. Les habitants ne consomment « que » 6,4 kilos de chocolat par an et par personne pour un total de 32 Nobel. Qu’à cela ne tienne, pour les chercheurs il s’agirait d’un simple favoritisme du comité Nobel. Si une corrélation est montrée pour les pays, l’étude ne dit rien en revanche sur le niveau de consommation individuel de chocolat des lauréats du Nobel.
23/11/2012 | Mise à jour: 11:12 Réactions (24) Par Hayat Gazzane
http://plus.lefigaro.fr/lien/le-chocolat-engendre-des-tueurs-en-serie-20121123-1589103
Des chercheurs britanniques se sont amusés à démonter l’étude de Franz Messerli qui établit une corrélation forte entre consommation de chocolat et prix Nobel. En utilisant la même méthodologie, ils arrivent à prouver que les pays où l’on mange beaucoup de chocolat sont aussi ceux qui engendrent le plus de serial killer et d’accidents de la route (étude en anglais).
and also « la statistique expliquée à mon chat »:
Shark attacks are correlated with ice cream sales: why?
Shark attacks are correlated with ice cream sales: why?
If you know the potential confounding variables, use partial correlation!
If you know the potential confounding variables, use partial correlation!
How to summarize numeric ( \(X\) ) vs factor ( \(Y \in \{1, ..., K\}\) )?
How to summarize numeric ( \(X\) ) vs factor ( \(Y \in \{1, ..., K\}\) )?
\(Y\) can be seen as a variable that defines groups of individuals
It turns out that: \(\mbox{Var}_{\textrm{intra}} + \mbox{Var}_{\textrm{inter}} = \mbox{Var}(X)\)
The correlation ratio is defined as: \(\eta(X|Y) = \sqrt{\frac{\mbox{Var}_{\textrm{inter}}}{\mbox{Var}_{\textrm{X}}}}\)
proportion of the variance explained by the groups
How to summarize numeric ( \(X\) ) vs factor ( \(Y \in \{1, ..., K\}\) )?
The type of chart depends on the variables’ types!
two factors:
two numeric variables:
From a sample, obtain general conclusions (with a control of the error) on the whole population from which the sample has been taken.
confidence interval: from the sample, define an interval in which the average value for a given variable is likely to be
statistical test: from the observations made on a sample, can we invalidate an assumption made on the whole population?
formulate an hypothesis \(H_0\)
from observations, calculate a test statistics (e.g., the standardized difference between means in the two samples)
formulate an hypothesis \(H_0\)
from observations, calculate a test statistics
find the theoretical distribution of the test statistics under \(H_0\)
formulate an hypothesis \(H_0\)
from observations, calculate a test statistics
find the theoretical distribution of the test statistics under \(H_0\)
deduce the probability that the observations occur under \(H_0\): this is called the p-value
formulate an hypothesis \(H_0\)
from observations, calculate a test statistics
find the theoretical distribution under \(H_0\)
deduce the p-value
conclude: if the p-value is low (usually below \(\alpha=5\)% as a convention), \(H_0\) is unlikely: we say that “\(H_0\) is rejected”. We have that:
\(\alpha = \mathbb{P}_{H_0} (H_0\mbox{ is rejected})\)
\(H_0 \Rightarrow\) theoretical distribution for a given test statistics
then
observed value has a low probability under the theoretical distribution \(\Rightarrow\) \(H_0\) is unlikely
\(\mathbb{P}(\mbox{Type I error}) = \alpha\) (risk)
\(\mathbb{P}(\mbox{Type II error}) = 1-\beta\) with \(\beta\): power
Hence:
the smaller the p-value, the smaller the risk to make an error while rejecting \(H_0\)
results of tests are not just white/black: the p-value gives a degree of confidence in the conclusion
The only way to simultaneously control \(\alpha\) and increasing \(\beta\) is to increase the sample size.
Basics on computing sample size
Facts on \(\beta\): \(\beta\) increases when:
\(\alpha\) increases
the sample size increases
the effect size increases (i.e., the alternative hypothesis becomes more distinct from the null hypothesis)
\(\Rightarrow\) defining a priori values for \(\alpha\), \(n\) and the effect size can be used to give an estimate of the power (and reciprocally, fixing a power can be used to target a relevant sample size)
Extract of ?power.t.test
:
Two-sample t test power calculation
n = 25
delta = 1
sd = 1
sig.level = 0.05
power = 0.9337076
alternative = two.sided
NOTE: n is number in *each* group
from observations, calculate a test statistics \(S_n\)
find the theoretical distribution of the test statistics
using this theoretical distribution, find an interval, IC, with a high probability ( \(1-\alpha\) ) to find the test statistics (as defined in the population) in: \[ \mathbb{P}(S_n \in \textrm{IC}) \geq 1-\alpha \]
Testing \(H_0\): “\(S\) is equal to 0” \(\Leftrightarrow\) \(0 \in \textrm{IC}\)
comparison of a mean to a given value ( \(H_0\) : \(\overline{X}\) is equal to 0): \(t\) test (Student)
\(X\) is supposed to follow a Gaussian law (with unknown variance)
the test statistics is: \(T = \frac{\overline{X} - 0}{\sigma_X / \sqrt{n}}\)
under the null hypothesis, the theoretical distribution is the Student law with \(n-1\) degrees of freedom
accepting the null hypothesis is equivalent to have the target tested value (0 in the example above) included in the confidence interval for \(X\)
adequation of the distribution of \(X\) to a given distribution (very often, the Gaussian distribution)
no assumption on the distribution of \(X\)
median: Wilcoxon test
normality: Shapiro-Wilk
any theoretical distribution: Kolmogorov-Smirnov, \(\chi^2\) (compare observed frequencies in intervals with theoretical frequencies)
Titanic (males, adults):
contingency table | row profiles |
Independence means the same level of survival whatever the class (for instance, 20% of survival whatever the class).
independence between the two variables:
\(\chi^2\) test (non parametric) of a contingency table
Fisher exact test (non parametric but limited to small contingency tables and sample sizes)
test for correlation/association between two numeric variables
factor can be seen as a group variable \(\Rightarrow\) these are called comparison of \(K\) samples
tests can be for paired samples or unpaired samples
Comparison of the distribution of \(X\) in group 1 and in group 2 (unpaired samples, non parametric)
Comparison of a central characteristic of \(X\) in group 1 and in group 2
Comparison of the variance / dispersion of \(X\) in group 1 and in group 2
Fisher test ( \(K=2\) ): \(X\) is supposed to be Gaussian (parametric)
Siegel-Tukey test (non parametric, can be used with an ordinal variable)
Comparison of a central characteristic of \(X\) in groups
Comparison of the variance / dispersion of \(X\) in groups
GIYF: choose a statistics test
not paired
extra group ID
9 0.0 A 9
12 0.8 B 2
6 3.4 A 6
15 -0.1 B 5
18 1.6 B 8
7 3.7 A 7
paired
before after
1 200.1 392.9
2 190.9 393.2
3 192.7 345.1
4 213.0 393.0
5 241.4 434.0
6 196.9 427.9
Case where \(Y\) (numeric) is explained by one or several explanatory variables \(X_j\) (all numeric): \[ Y = a + b X + \epsilon \]
Case where \(Y\) (numeric) is explained by one or several explanatory variables \(X_j\) (all numeric): \[ Y = a + b X + \epsilon \] Very important remark: Testing \(H_0:\ b=0\) is exactly equivalent to testing \(\textrm{Cor}(X,Y) = 0\) (Pearson correlation test)!
\[ Y = a + b_1 X_1 + b_2 X_2 + \epsilon \]
Call:
lm(formula = Sepal.Width ~ Petal.Width, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.09907 -0.23626 -0.01064 0.23345 1.17532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.30843 0.06210 53.278 < 2e-16 ***
Petal.Width -0.20936 0.04374 -4.786 4.07e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.407 on 148 degrees of freedom
Multiple R-squared: 0.134, Adjusted R-squared: 0.1282
F-statistic: 22.91 on 1 and 148 DF, p-value: 4.073e-06
interpretation: one unit gained in Petal.Width makes 0.209 unit decreased in Sepal.Width
\(t\) value corresponds to a Student test for \(H_0:\ b = 0\)
this test is equivalent to the (Pearson) correlation test between \(X\) and \(Y\)
Case where \(Y\) (numeric) is explained by one or several explanatory variables \(X_j\) (all numeric): \[ Y = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + \epsilon \]
Call:
lm(formula = Sepal.Width ~ Sepal.Length + Petal.Width + Petal.Length,
data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.88045 -0.20945 0.01426 0.17942 0.78125
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.04309 0.27058 3.855 0.000173 ***
Sepal.Length 0.60707 0.06217 9.765 < 2e-16 ***
Petal.Width 0.55803 0.12256 4.553 1.1e-05 ***
Petal.Length -0.58603 0.06214 -9.431 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3038 on 146 degrees of freedom
Multiple R-squared: 0.524, Adjusted R-squared: 0.5142
F-statistic: 53.58 on 3 and 146 DF, p-value: < 2.2e-16
\(X\) with \(K\) levels is (silently) recoded into 0/1 variables:
if \(X\) is {blue, red}, then the linear model is: \(Y_i = \alpha_b \mathbf{1}_{\{X_i \textrm{ is blue}\}} + \alpha_r \mathbf{1}_{\{X_i \textrm{ is red}\}} + \epsilon\)
or (usually preferred version version) \(Y_i = \underbrace{\beta_0}_{\textrm{basal level of }Y} + \underbrace{\beta_r}_{\textrm{additional level when }X_i\textrm{ is red}} \mathbf{1}_{\{X_i \textrm{ is red}\}} + \epsilon\)
with the relation: \(\left\{\begin{array}{l} \alpha_b = \beta_0\\ \alpha_r = \beta_0 + \beta_r \end{array}\right.\)
\(X\) with \(K\) levels is (silently) recoded into 0/1 variables:
Interpretation of coefficients:
\(Y_i = \alpha_b \mathbf{1}_{\{X_i \textrm{ is blue}\}} + \alpha_r \mathbf{1}_{\{X_i \textrm{ is red}\}} + \epsilon\): testing \(H_0:\ \alpha_r = \alpha_b\) is exactly equivalent to an ANOVA (or Student test) of \(Y\) between the two groups of \(X\) (\(Y \sim X\))
\(Y_i = \underbrace{\beta_0}_{\textrm{basal level of }Y} + \underbrace{\beta_r}_{\textrm{additional level when }X_i\textrm{ is red}} \mathbf{1}_{\{X_i \textrm{ is red}\}} + \epsilon\): testing \(H_0:\ \beta_r = 0\) is also exactly equivalent
Call:
lm(formula = Sepal.Width ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.128 -0.228 0.026 0.226 0.972
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.42800 0.04804 71.359 < 2e-16 ***
Speciesversicolor -0.65800 0.06794 -9.685 < 2e-16 ***
Speciesvirginica -0.45400 0.06794 -6.683 4.54e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3397 on 147 degrees of freedom
Multiple R-squared: 0.4008, Adjusted R-squared: 0.3926
F-statistic: 49.16 on 2 and 147 DF, p-value: < 2.2e-16
\(X\) with \(K\) levels is recoded automatically into ( \(K-1\) ) 0/1 variables:
Single term deletions
Model:
Sepal.Width ~ Species
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 16.962 -320.95
Species 2 11.345 28.307 -248.13 49.16 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residuals ( \(\epsilon\) ) are not correlated, with the same variance, not correlated to \(X\) and Gaussian
the number of observations is larger (much larger is better) than the number of variables
Call:
glm(formula = Species ~ Sepal.Width, family = binomial(link = logit),
data = iris)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 20.230 4.165 4.857 1.19e-06 ***
Sepal.Width -6.552 1.350 -4.853 1.22e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.629 on 99 degrees of freedom
Residual deviance: 71.421 on 98 degrees of freedom
AIC: 75.421
Number of Fisher Scoring iterations: 6
The model is \[ \log\left[\frac{\mathbb{P}(Y = \textrm{versicolor})}{1-\mathbb{P}(Y = \textrm{versicolor})}\right] = a + b X \]
\(b\) is also interpretable: \(\exp(b)\) is the odd ratio of the outcome when \(X\) increases of one unit (or when \(X\) is 1 if \(X\) is a binary variable)
Framework: Suppose you are performing \(G\) tests at level \(\alpha\), \(\mathbb{P}(\mbox{at least one FP if }H_0\mbox{ is always true}) = 1 - (1-\alpha)^G\)
Ex: for \(\alpha=5\)% and \(G=20\) , \(\mathbb{P}(\mbox{at least one FP if } H_0\mbox{ is always true}) \simeq 64\) %!!!
Framework: Suppose you are performing \(G\) tests at level \(\alpha\), \(\mathbb{P}(\mbox{at least one FP if }H_0\mbox{ is always true}) = 1 - (1-\alpha)^G\)
Ex: For more than 75 tests and if \(H_0\) is always true, the probability to have at least one false positive is very close to 100%!
Number of decisions for \(G\) independent tests:
Instead of the risk \(\alpha\), control:
familywise error rate (FWER): FWER \(= \mathbb{P}(U>0)\) (i.e., probability to have at least one false positive decision)
false discovery rate (FDR): FDR \(= \mathbb{E}(Q)\) with \(Q = \left\{ \begin{array}{cl} U/R & \mbox{if }R>0\\ 0 & \mbox{otherwise} \end{array} \right.\)
Settings: p-values \(p_1\), …, \(p_G\) (e.g., corresponding to \(G\) tests on \(G\) different genes)
adjusted p-values are \(\tilde{p}_1\), …, \(\tilde{p}_G\) such that:
Rejecting tests such that \(\tilde{p}_g < \alpha\) \[ \qquad \Longleftrightarrow \quad \left\{ \begin{array}{l} \mathbb{P}(U > 0) \leq \alpha\\ \mathbb{E}(Q) \leq \alpha \end{array} \right.\]
Calculating adjusted p-values (for independent tests)
When using a multivariate model like:
\[ Y = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + \epsilon \]
you usually performed tests in two steps:
step 1: test the full model against the null model \(Y = a + \epsilon\);
step 2: if test at step 1 is significant, test each coefficient using a correction called Tukey’s Honestly Significant Difference (Tukey’s HSD).
Main objective: summarize a large number of numeric variables using a few number of combinations from these variables
\(n \textrm{ individuals } \left\{ \begin{array}{c} ...\\ ...\\ \underbrace{...}_{p \textrm{ variables}} \end{array} \right.\)
data: \(X = \left( \begin{array}{cc} 1 & 3 \\ 2 & 4 \\ 2 & 1 \end{array}\right)\) can be represented by:
But what can we do if more than 2 or 3 columns?
PCA:
tries to maximize variability in the projection.
creates components that are linear combinations of the original variables
Data: 50 individuals, 3 variables
Example 1: no correlation between variables
Example 2: linear correlation between x1 and x2
Example 3: linear correlation between x1 and x2 and x3
Purpose: group individuals that look alike
Depends on:
the number of groups
what “look alike” means (distance choice)
Here: two types of clustering (HC and \(k\)-means)
Is based on:
a distance between individuals (usually, Euclidean distance)
linkage (distance between groups of individuals): common linkage is Ward’s linkage
Is iterative:
each individual is a group
find the two closest group and merge them in a new group (for Ward’s: minimize loss of within-group variability)
end when there is only one group with all individuals
Distances between individuals are read following branches:
HC provides a solution for any number of clusters but can be very difficult to use if \(n\) is large
\(k\)-means requires that the number of clusters is chosen in advance but it is fastest
\(k\)-means is stochastic: it gives different solutions at each run depending on the initialization
HC with Ward’s linkage can be seen as an approximate solution of the “best” \(k\)-means
In practice: use HC first and initialize \(k\)-means with HC
Heavily inspired from Sébastien Déjean’s previous version of the class.
With ideas taken from http://r-graph-gallery.com/ as well (for the part on graphics)
slide 31: contingency table from Wikimedia Commons, by ASnieckus “Table of gender by major.png”
slide 46: dot plot from http://www.sthda.com/english/wiki/wiki.php?id_contents=7868
slide 47: barplot from ggplot2 documentation
slide 48: scatterplot matrix from http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
slide 64: Galton boxes from Wikimedia Commons, by Marcin Floryan “Galton_Box.svg” and Antoine Taveneaux “Planche_de_Galton.jpg”
slide : \(k\)-means clustering from Wikimedia Commons, by Mquantin https://commons.wikimedia.org/w/index.php?curid=61321400 results.