How to shuffle an array

Nov. 24, 2021

Suppose we have an array of N elements and we need to reorder it randomly. That is a common scenario in machine learning problems when we want to extract a random sample of observations from a dataset. Another scenario is when we want to divide the dataset into random blocks, for example train and test subsets. A solution is to shuffle the dataset and take n sequential elements for each block.

To shuffle a dataset, we can start with an array of N sequential numbers that represents the order of the dataset observations. We shuffle that array and next, we use it as indices to retrieve the observations. A pseudocode solution is shown below.

Input:
    D  : dataset
    N  : number of observations
Output:
    D' : suffled dataset

x  := 1..N
x' := shuffle(x)
D' := D[x'[1]], D[x'[2]], ..., D[x'[N]]

return D'

Therefore, we need a shuffle() function to efficiently shuffle the array of indices. For that, we can use the Fisher-Yates algorithm (1948) that runs in O(n) time. It uses a random number r between 1 and N-1 to swap the element at position N with the element at position r. As a result, the element at position N is already shuffled. The procedure is recursively repeated for the N-1 remaining elements.

Function: shuffle

Input:
    N  : array size
Output:
    x : shuffled array

x := 1..N
i := N

while i > 1:
  r := random(i - 1)
  v := x[i]
  x[i] := x[r]
  x[r] := v
  i := i - 1

return x

The implementation in C of the shuffle() function is presented below. A random number generator from the GNU Scientific Library (GSL) has been used. Indices have been adjusted because in C arrays start at 0 instead of 1.

#include <gsl/gsl_rng.h>

int *shuffle(int N, int seed) {
    int i;
    int r, v;
    int *x = malloc(sizeof(int) * size);

    /* Random number generator allocation */
    gsl_rng *rndgen = gsl_rng_alloc(gsl_rng_taus);
    gsl_rng_set(rndgen, seed);

    /* Vector initialization */
    for (i = 0; i < size; i++)
        x[i] = i;

    /* Shuffling algorithm */
    for (i = size - 1; i > 0; i--) {
        r = gsl_rng_get(rndgen) % i;
        v = x[r];
        x[r] = x[i];
        x[i] = v;
    }

    /* Cleanning */
    gsl_rng_free(rndgen);
    return x;
}

References:

Jaime López
Centereach, NY

Sections:

Profiles: