In short: PHP's array_rand function picks elements randomly, but their
          relative ordering in the output is completely predictable.
          This predictability was broken by an ancient and buggy bugfix
          which makes array_rand shuffle the output in a very chaotic way.

This is a review of array_rand() from php-5.2.9/ext/standard/array.c.
It analyzes an issue in the function which appears if $num_req > 1.
It is based on a posting on php.net which points out the problem.

There, it was demonstrated that although array_rand chooses each element
randomly, the order in which the elements appear in the result is hihgly
non-uniform. Initial elements tend to appear first, and similarly elements
at the end of the array appear last. Often the result's keys appear sorted.

The algorithm used in array_rand is both simple and ingenious at the same time.
It does a single pass over the input array left-to-right, and in each step it
uses rand() to decide whether to add the current element to the output.
The probability of this occuring is equal to nRequested / nRemaining.
As nRemaining decreases on each loop, the fraction will ultimately reach 1
(unless the requested number of elements has been exhausted already).
At this point the algorithm enters an implicit 'panic mode' where it will
accept each and every remaining element. Then the loop ends. And then there's
a final step where the result array is shuffled if a special condition is met.

It should be possible to prove that the main part of the algorithm preserves
the order of the input elements when generating the output array, and for
some reason also satisifies the condition that each entry is picked at random.

Regarding tests, the original poster provided a sample script to demonstrate
the issue with the output ordering. Here it is in a version modified for
$req_num = 3 instead of 2, since it demonstrates the problem a bit better.

$n = 1000000; $k = 3; $a = array(0,1,2,3,4); $b = array(); for( $i = 0; $i < sizeof($a); $i++ ) $b[$i] = array(0,0,0); // counters for( $i = 0; $i < $n; $i++ ) { $keys = array_rand($a,$k); $b[$keys[0]][0]++; $b[$keys[1]][1]++; $b[$keys[2]][2]++; } for( $i = 0; $i < sizeof($b); $i++ ) printf("%d: %04.1f%% %04.1f%% %04.1f%% | %04.1f%%\n", $i, 100*$b[$i][0]/$n/$k, 100*$b[$i][1]/$n/$k, 100*$b[$i][2]/$n/$k, 100*($b[$i][0]+$b[$i][1]+$b[$i][2])/$n/$k);
For a sufficient amount of trials, it produces the following outcome: 0: 13.3% 03.2% 03.2% | 19.8% 1: 06.8% 09.9% 03.3% | 20.1% 2: 03.2% 10.3% 06.8% | 20.3% 3: 03.2% 03.4% 13.6% | 20.1% 4: 06.8% 06.5% 06.5% | 19.8% These test data demonstrate that every element has a very likely chance of appearing in the same position in the output that it has in the input, relative to the other values. Also note the anomaly in the last position: it has the most uniform values than all other positions. Tests show that it is not related to divisibility since it occured for all tested input and output array sizes. What follows is a chunk of php source code related to the problem.
/* {{{ proto mixed array_rand(array input [, int num_req]) Return key/keys for random entry/entries in the array */ PHP_FUNCTION(array_rand) { zval **input, **num_req; long randval; int num_req_val, num_avail, key_type; char *string_key; uint string_key_len; ulong num_key; HashPosition pos; if (ZEND_NUM_ARGS() < 1 || ZEND_NUM_ARGS() > 2 || zend_get_parameters_ex(ZEND_NUM_ARGS(), &input, &num_req) == FAILURE) { WRONG_PARAM_COUNT; } if (Z_TYPE_PP(input) != IS_ARRAY) { php_error_docref(NULL TSRMLS_CC, E_WARNING, "First argument has to be an array"); return; } num_avail = zend_hash_num_elements(Z_ARRVAL_PP(input)); if (ZEND_NUM_ARGS() > 1) { convert_to_long_ex(num_req); num_req_val = Z_LVAL_PP(num_req); if (num_req_val <= 0 || num_req_val > num_avail) { php_error_docref(NULL TSRMLS_CC, E_WARNING, "Second argument has to be between 1 and the number of elements in the array"); return; } } else num_req_val = 1; /* Make the return value an array only if we need to pass back more than one result. */ if (num_req_val > 1) { array_init(return_value); } /* We can't use zend_hash_index_find() because the array may have string keys or gaps. */ zend_hash_internal_pointer_reset_ex(Z_ARRVAL_PP(input), &pos); while (num_req_val && (key_type = zend_hash_get_current_key_ex(Z_ARRVAL_PP(input), &string_key, &string_key_len, &num_key, 0, &pos)) != HASH_KEY_NON_EXISTANT) { randval = php_rand(TSRMLS_C); if ((double)(randval/(PHP_RAND_MAX+1.0)) < (double)num_req_val/(double)num_avail) { /* If we are returning a single result, just do it. */ if (Z_TYPE_P(return_value) != IS_ARRAY) { if (key_type == HASH_KEY_IS_STRING) { RETURN_STRINGL(string_key, string_key_len-1, 1); } else { RETURN_LONG(num_key); } } else { /* Append the result to the return value. */ if (key_type == HASH_KEY_IS_STRING) add_next_index_stringl(return_value, string_key, string_key_len-1, 1); else add_next_index_long(return_value, num_key); } num_req_val--; } num_avail--; zend_hash_move_forward_ex(Z_ARRVAL_PP(input), &pos); } if (num_req_val == num_avail) { array_data_shuffle(return_value TSRMLS_CC); } } /* }}} */
Here's a PHP implementation of PHP's internal array_rand() function: (code to handle the case $num_val = 1 is not interesting and was omitted)
function php_array_rand(array $input, $num_req_val = 1) { if( $num_req_val < 2 ) return false; // not interested in this case $num_avail = count($input); $return_value = array(); foreach( $input as $pos => $value ) { if( $num_req_val == 0 ) break; $randval = rand(); if( $randval/(getrandmax()+1) < $num_req_val / $num_avail ) { $return_value[] = $value; $num_req_val--; } $num_avail--; } if( $num_req_val == $num_avail ) shuffle($return_value); return $return_value;
This code produces identical statistics to that of the builtin array_rand. This allows us to experiment with it. For example, this is how the statistics look like if the final 'shuffle' phase is removed: 0: 19.9% 00.0% 00.0% | 19.9% 1: 10.1% 10.0% 00.0% | 20.1% 2: 03.3% 13.3% 03.3% | 19.9% 3: 00.0% 10.1% 09.9% | 20.1% 4: 00.0% 00.0% 20.1% | 20.1% In this case every sequence generated has its element order preserved. If we remove the condition that prevents the shuffle() function from being called, we get a nice uniform distribution. So either way looks good enough. After some more research, I found the cause of this chaotic behavior. It was this ancient php bugreport from october 2000 where a nub user complained that array_rand($array, count($array)) was returning the same array, and was not "random". Here both the user and the assigned developer should have understood that it's not this function's task to "randomize the output". The only thing array_rand should do is to "pick K elements out of N at random". However, they failed to do so and so the following patch patch was applied to PHP. @- Fixed array_rand() to shuffle results when the number of requested @ elements is the same as the number of elements in the array. (Andrei)
--- /repository/php-src/ext/standard/array.c 2000/10/25 17:40:10 1.79 +++ /repository/php-src/ext/standard/array.c 2000/10/27 14:08:33 1.80 @@ -20,7 +20,7 @@ +----------------------------------------------------------------------+ */ -/* $Id: array.c,v 1.78 2000/10/22 11:18:21 venaas Exp $ */ +/* $Id: array.c,v 1.79 2000/10/25 17:40:10 andrei Exp $ */ #include "php.h" #include "php_ini.h" @@ -2764,6 +2764,13 @@ num_avail--; zend_hash_move_forward(Z_ARRVAL_PP(input)); } + + if (num_req_val == num_avail) { + if (zend_hash_sort(Z_ARRVAL_P(return_value), (sort_func_t)mergesort, array_data_shuffle, 1) == FAILURE) { + zval_dtor(return_value); + RETURN_FALSE; + } + } } /* }}} */
Not only is this patch wrong on the design level, it also fails on the implementation level: the check that was added uses variables that change during execution, and reach the same value ('0') in random cases, even those which this patch was not intended for. I have submitted my findings to the php bugtracker as bug #48224. Here is a cvs diff/patch for current CVS revision of php:
Index: array.c =================================================================== RCS file: /repository/php-src/ext/standard/array.c,v retrieving revision 1.475 diff -r1.475 array.c 4117,4120d4116 < < if (num_req == num_avail) { < php_array_data_shuffle(return_value TSRMLS_CC); < } >
However, it looks like this bug that makes array_rand randomize the array if you request all elements will have to become a feature due to backwards compatibility. UPDATE: The bugreport has been marked as 'Closed' and a change change was applied that removed the offending lines. Affects php versions 6, 5.3 and 5.2. Yay. Author: umage <umage@netvor.sk> Created: 10. may 2009 Modified: 29. sep 2009 (bug fixed) Modified: 16. may 2011 (fixed php repository urls after cvs -> svn migration)