Minimizing Worst-Case Complexity

d-viz

I recently received a very interesting submission to the Typescript Programming Test Problems Repo. The problem was given to a senior developer as part of a homework exercise in which he was supposed to analyze the problem, prepare a brief presentation, and then present the results at the on-site interview. This is, of course, a very realistic simulation of the type of work environment in which the candidate may have to perform. I also thought the problem was interesting in that it dealt with a topic that is not often considered when analyzing complexity, namely minimizing worst-case complexity.

Problem: You are given an array, A, of numbers of length N. You are also given a function, F, that accepts a single number as an argument and it returns a boolean. The function body is not relevant; it returns true if a certain condition is met and false otherwise. It is known that if the function returns true for A[j], j = 0, 1, … N-1, then it returns true for j+1, … N-1. Suggest a strategy that identifies the smallest index, j, such that F( A[j] ) is true with no more than two calls to F where the function returns true. F may return false as many times as needed for the algorithm to work. Return -1 if no such index exists. The strategy for this problem should be designed to minimize the worst-case complexity. Your presentation should include at least detailed pseudocode for your suggested algorithm (bonus points for actual code and specs).

Although the stated problem may sound highly artificial, I did work on an application some time ago in which a complex true/false test was involved. In some cases, a couple conditions could be tested and false could be quickly proven.  Proving true or false was much more computationally intense in absence of those conditions. We ended up strategically re-ordering the data so that it was possible to eliminate large chunks of data from the analysis. So, the problem may be less outlandish that it appears on first reading.

The person submitting the problem came to the conclusion that the ‘two true returns’ requirement be relaxed to three and use binary search to identify a smaller interval, followed by a linear search to locate the exact index.

It is natural for a good programmer to think of binary search when scanning an array for a condition in which an order relationship is known to hold in advance. In terms of worst case for the stated problem, it will not generally be possible to locate the required index and make no more than two calls to F where the return is true.

For purposes of the discussion, array elements are referred to as first, second, … Nth in terms of one-based indices, so aj refers to the jth element, j = 1, 2, … N. The function is referenced in lowercase, i.e. f().

A simple implementation of a linear search is as follows:

1 – Test aN. If f returns false, then return -1

2 – Otherwise, start at a1 and test each element in sequence through N-1. If f returns true for any index, return either that index or -1 if all array elements test false.

In terms of worst-case complexity, the worst case is when f(aN-1) is true. Although only one true return is required, all N array elements are tested.

The linear search may be improved by using a constant step size, k. In this case, the array is broken into intervals and a ‘forward sweep’ is performed by skipping over k elements each test. The goal is to find any array element in which the test is true by ‘leaping’ over many in which it is false. Once a true test is discovered, the previous k-1 array elements must be tested in sequence to isolate the first index for which the test is true.

This approach is guaranteed to have no more than two true results and it will always locate the required index. The worst case for this algorithm occurs when the minimum array index is next to last. Let’s temporarily presume that k is chosen to divide N evenly so that the chosen step size produces exactly N/k = Nk-1 intervals. The expected number of function evaluations, e(k), is given by

e(k) = Nk-1 + k – 1

Although this is a problem in discrete math, we can use a continuous model to obtain some insight into the best choice for k. A necessary condition for a minmimum of e is that the first derivative is zero,

e'(k) = -Nk-2 + 1 = 0 => k-2 = N-1

or k = sqrt(N).

The second derivative, e”(k), is 2Nk-3 which is everywhere positive for k > 0, which indicates that the extreme point is a minimum. Now, it may be that sqrt(N) is not exactly integral; take N = 50, for example. sqrt(50) ~ 7.07, so we might surmise that k = 7 is a good choice, although this leaves an interval of width 1 at the end. The ones-based array elements tested are 7, 14, 21, 28, 35, 42, 49, 50 (there are actually 8 intervals). The worst case occurs at array element 48. This requires testing 7 (F), 14 (F), 21 (F), 28 (F), 35 (F), 42 (F), 49 (T), 43 (F), 44 (F), 45 (F), 46 (F), 47 (F), 48 (T), or a total of 13 tests.

Note that we could have taken the ceiling of sqrt(N) in this case and achieved the same result. With k = 8, the worst-case scenario is the first true test at the 47-th element. Tests are 8 (F), 16 (F), 24 (F), 32 (F), 40 (F), 48 (T), 41 (F), 42 (F), 43 (F), 44 (F), 45 (F), 46 (F), 47 (T), for a total of 13.

Although this heuristic provides a ‘good’ value for the constant step-size model, nothing has been shown to indicate that this is the best possible model out of all possible models for this problem. Consider, for example, the following variable step-size model,

ki = ki-1 + d

where d is a non-zero integer. In this case, a starting value of k is chosen and then subsequent step sizes become larger or smaller by a fixed amount. For d > 0, the first interval is of width k, the second is of width k+d, the third is of width k+2d, etc. This trades jumping through large parts of the array that hopefully return false for an increasing linear search interval. The latter tradeoff is the problem with such a model in terms of worst-case scenario. Consider an example with N = 50, k = 8, and d = 1. Initial elements tested are 8, 17, 27, 38, 50. The worst-case scenario is when the first true result occurs at the 49th element. This requires 5 tests to locate the interval, followed by testing elements 39 through 49 in sequence for a total of 16 tests. The situation does not improve by increasing the initial value. Take 10, for example. Initial elements tested are 10, 21, 33, 46, 50. Worst case scenario is at element 45 for a total of 4 + 13 or 17 tests.

While not a conclusive proof, these counter-examples illustrate the problem with positive step-size increases, even with the smallest-possible value of d = 1. The problem is only exacerbated with larger values of d, so let’s discard that part of the model and consider negative values of d. Again, we begin with the smallest-magnitude value, d = -1, which yields intervals of widths k, k-1, k-2, k-3, …

Suppose that k is chosen so that there are exactly m intervals (and this may not always be true in the general case).

In that instance,

k + (k-1) + (k-2) + (k-3) + … + (k-m-1) = N

mk – (1 + 2 + … + (m-1)) = N

The second term on the left-hand side is the sum of the first m-1 integers which is (m-1)m/2 , so

mk – ((m-1)m)/2 = N or mk – m2 + m = 2N

This yields a quadratic equation just to isolate the number of intervals and that is only in the ideal case. Analyzing the optimal choice for initial k with a continuous model is a rabbit hole for which we do not wish to discover the depth :)

There is some potential value with this model as we could start with a larger choice of k than sqrt(N). Since the interval width decreases by one each step, each successive ‘forward’ step is balanced by a corresponding reduction in the number of sequential tests required in that interval. The ideal starting value of k, however, is still an open question.

In many cases such as this, we can gain some insight by a analyzing a good choice for successive values of N and see if a pattern emerges. If such a heuristic holds for expected values of N, then we have a reasonable starting point for the problem that is commensurate with a proper timebox for this type of analysis. Remember that the goal is to investigate a problem and prepare a report, not write a paper for a referred journal. Demonstrating how you timebox and apply yourself to such an analysis is likely to be an important part of the interview process.

For N = 1, k = 1. For N = 2 and 3, k = 2. The following sequence uses a value of k that is optimal based on explicit enumeration. The interval width is decreased by one until it is not possible to decrease any further without exceeding N or the interval width is reduced to one.  There are as many of these singleton intervals as needed to reach the end of the array.

e* is the expected number of worst-case function evaluations to solve the problem.

N = 4
k = 2
intervals: 2, 1, 1
e* = 3

—–

N = 4
k = 3
intervals: 3, 1
e* = 3

N = 5
k = 3
intervals: 3, 2
e* = 3

N = 6
k = 3
intervals: 3, 2, 1
e* = 3

—–

N = 7
k = 4
intervals: 4, 3
e* = 4

N = 8
k = 4
intervals: 4, 3, 1
e* = 4

N = 9
k = 4
intervals: 4, 3, 2
e* = 4

N = 10
k = 4
intervals: 4, 3, 2,1
e* = 4

—–

N = 11
k = 5
intervals: 5, 4, 2
e* = 5

N = 12
k = 5
intervals: 5, 4, 3
e* = 5

.
.
.

N = 20
k = 6
intervals: 6, 5, 4, 3, 2
e* = 6

.
.
.

N = 50
k = 10
intervals: 10, 9, 8, 7, 6, 5, 4, 1
e* = 10

The noticeable pattern from these enumerations is that k is to be chosen such that the sum of the first k integers is greater than or equal to N. This value also happens to be the number of worst-case evaluations to locate the required index.

No proof is supplied; the analysis is completely heuristic. In any event,

k(k+1)/2 ≥ N

k2 + k ≥ 2N or k2 + k – 2N ≥ 0    [1]

If N = 100, for example, k = 14, which yields e* = 14. This is better than e* = 19, which is the best result from a constant step size. In general, one step of refinement of a direct solution may be required to satisfy the inequality since k is constrained to be integral.

The next consideration is a variable step size model. Choose some initial interval width, k0. The i-th interval width is given by ki = ki-1 + di, where di is a nonero integer. You should have observed by now that the worst-case complexity associated with each interval is the number of steps to get to the interval endpoint plus the interval width minus one. Based on this observation, k0 should be less than that indicated by eq. [1]. No subsequent interval width could equal the optimal value of k from [1], otherwise the worst-case complexity would be no better than the (constant) decreasing interval model. With this restriction on k0, the number of steps to reach the final interval increases, which means that it is likely that no variable step size could improve upon the minimal worst-case complexity from the constant decreasing step-size model. This is an intuitive argument – not a prooof – which is suitable for an initial presentation on a topic such as this one.

The conclusion of this brief analysis is that the constant, decreasing step-size model is the best one to apply for the task of minimizing worst-case complexity for the stated problem. The bonus code is available from the TS Programming Test Problems Github repo.

I sure hope the bonus is a Starbucks gift card :)

Comments are closed.