How to shuffle an array
Nov. 24, 2021
Suppose we have an array of N elements and we need to reorder it randomly. That is a common scenario in machine learning problems when we want to extract a random sample of observations from a dataset. Another scenario is when we want to divide the dataset into random blocks, for example train and test subsets. A solution is to shuffle the dataset and take n sequential elements for each block.
To shuffle a dataset, we can start with an array of N sequential numbers that represents the order of the dataset observations. We shuffle that array and next, we use it as indices to retrieve the observations. A pseudocode solution is shown below.
Input:
D : dataset
N : number of observations
Output:
D' : suffled dataset
x := 1..N
x' := shuffle(x)
D' := D[x'[1]], D[x'[2]], ..., D[x'[N]]
return D'
Therefore, we need a shuffle() function to efficiently shuffle the array of indices. For that, we can use the Fisher-Yates algorithm (1948) that runs in O(n) time. It uses a random number r between 1 and N-1 to swap the element at position N with the element at position r. As a result, the element at position N is already shuffled. The procedure is recursively repeated for the N-1 remaining elements.
Function: shuffle
Input:
N : array size
Output:
x : shuffled array
x := 1..N
i := N
while i > 1:
r := random(i - 1)
v := x[i]
x[i] := x[r]
x[r] := v
i := i - 1
return x
The implementation in C of the shuffle() function is presented below. A random number generator from the GNU Scientific Library (GSL) has been used. Indices have been adjusted because in C arrays start at 0 instead of 1.
#include <gsl/gsl_rng.h>
int *shuffle(int N, int seed) {
int i;
int r, v;
int *x = malloc(sizeof(int) * size);
/* Random number generator allocation */
gsl_rng *rndgen = gsl_rng_alloc(gsl_rng_taus);
gsl_rng_set(rndgen, seed);
/* Vector initialization */
for (i = 0; i < size; i++)
x[i] = i;
/* Shuffling algorithm */
for (i = size - 1; i > 0; i--) {
r = gsl_rng_get(rndgen) % i;
v = x[r];
x[r] = x[i];
x[i] = v;
}
/* Cleanning */
gsl_rng_free(rndgen);
return x;
}
References:
- Fisher, R. & Yates, F. (1948). Statistical tables for biological, agricultural, and medical research (3rd. Ed.). London: Oliver & Boyd. pp. 26-27.
- Fisher–Yates shuffle. In Wikipedia. https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle