Consider a model of computation (MOC) similar to the common PC (a 'RAM'), where every operation is associated with a cost bounded by a constant independent of what the operand registers contain. If we, in our computational model, do not bound the word size of this machine, we can do things like linear-time sorting and solving NP-hard problems in linear time. We do this simply by packing enough values into our registers, allowing us to even "perform an exponential number of operations in one step".
Worse yet, owing to the MOC's ease of analysis and similarity to algorithms for modern PC's, it is today the standard in papers analyzing polynomial-time algorithms. If you don't "cheat", the difference with, e.g., the bounded-word RAM or the logarithmic-cost model is just a logarithmic factor and also what we see in practice. As a theorist, I find this situation disturbing, however. Papers on, e.g., sorting clearly state the word size they are considering, I presume, for this very reason.
In other words, the constant-cost unbounded-word RAM cannot be simulated by a universal Turing machine in polynomial time.
Before we go to the proof sketch, let's give an example of the algorithm's progress.
Let the input be 4, 1, 7, 3, and 5. The maximum is 7, so we can represent the numbers with the
binary strings 100, 001, 111, 011, and 101, respectively.
We construct the following padded string. The inputs are surrounded by parentheses for ease of reading.
1( 100 )1( 001 )1( 111 )1( 011 )1( 101 )
We rotate four times and set the padded 1s to 0s
0( 001 )0( 111 )0( 011 )0( 101 )0( 100 )
0( 111 )0( 011 )0( 101 )0( 100 )0( 001 )
0( 011 )0( 101 )0( 100 )0( 001 )0( 111 )
0( 101 )0( 100 )0( 001 )0( 111 )0( 011 )
Subtract these strings from the original string, one by one.
1( 011 )0( 010 )1( 100 )0( 110 )1( 001 )
0( 101 )0( 110 )1( 010 )0( 111 )1( 100 )
1( 001 )0( 100 )1( 011 )1( 010 )0( 110 )
0( 111 )0( 101 )1( 110 )0( 100 )1( 010 )
The 1s outside the parentheses are the interesting bits. We get rid of the garbage and shift everything to the right three steps.
0( 001 )0( 000 )0( 001 )0( 000 )0( 001 )
0( 000 )0( 000 )0( 001 )0( 000 )0( 001 )
0( 001 )0( 000 )0( 001 )0( 001 )0( 000 )
0( 000 )0( 000 )0( 001 )0( 000 )0( 001 )
Note that these 1s indicate for which pairs of input numbers, the first is strictly smaller than the second.
We sum all four of them together to get a tally of the numbers strictly smaller than the first.
0( 010 )0( 000 )0( 100 )0( 001 )0( 011 )
In other words, '4' is strictly greater than 010b = 2 of the inputs, '1' is strictly greather than 000b = 0 etc.
We therefore place ''4' in the third "slot", '0' in the first etc.
0( 000 )0( 001 )0( 011 )0( 100 )0( 111 )
Finally, we can read and give as input the sorted sequence, 1 3 4 5 7.
Now to describe the above algorithm. If you're confident that the above procedure works, you may skip to Example 2.
Proof by algorithm construction. Let be given an arbitrary sequence of integers x1, ..., xn. Let the size of the input, in bits, be N, where n ≤ N ∈ O(n + log max xi). We show that there is an O(N) algorithm for this problem. If we assume that the above numbers are given in n different registers, this gives O(n) time instead. (Luite Stegeman notes that radix sort is O(N) in both cases.)
We construct n binary strings Si which looks like the following, with the xis padded to m-1 bits each.
s1 = 1 x1 1 x2 ... 1 xn,
s2 = 1 x2 1 x3 ... 1 x1,
s3 = 1 x3 1 x4 ... 1 x2,
...
sn-1 = 1 xn-1 1 xn-1... 1 xn-2,
sn = 1 xn 1 x1... 1 xn-1,
How to do this in linear time? Identify the number of bits used by the largest xi and call it m. Construct the first string by, for each k, OR-ing the k:th input number shifted (n-k-1) * m steps to the left, onto the first string, and similarly for the 1s. This is uses a total number of operations linearly bounded. We can produce the k:th string by shifting the (k-1):th string m steps to the left. This will cause m zeroes to appear on the right. We can move the leftmost integer and a 1 to this place by ANDing out the leftmost appearance and ORing on the right number and a 1. Hence we can get every subsequent string for constant work each, a total of linear work.
Let us refer to xk in s1 as the number in the k:th
position in s1. Similarly, we let the number in the k:th position in si
refer to x(i+k-2)%n+1. We name the following binary string 'the mask'.
The mask has a 1 in the n times m bits used by our input numbers and 0s elsewhere.
m = 1 0...0 1 0...0 ... 1 0...0
We take our second string, subtract the mask from it to get rid of the extra 1s, and then subtract it from the first string. In the
new string, the 1 on the left of the k:th position in the first string will have disappeared if and only if the number in
the k:th position of the second string is strictly greater than xk.
We do the same for every other string, i.e. subtracting the mask, subtracting the result from the first string, and then masking the
new string to get the 0s and 1s. All of this in total linear time. We shift the new strings to the right m steps and the 0s and 1s will be
the elements of the new strings. We add all of the new strings together which will produce a tally string where the k:th
element counts the number of the input strings i ≠ k strictly smaller than xk.
We prepare a sorted string in the following way. Set the sorted string to 0. For every k, we read the k:th element by shifting and ANDing. We multiply this number by m,
shift xk this many bits to the left, and OR the result onto the sorted string. We now have easy access
to the sorted numbers and can produce other output formats, e.g. one of length O(N). All in all, we got away with paying only O(N).