## Recitation 08: Big-O and Sorting

### Runtimes and Big-O

Last week, as we discussed two different searching algorithms, it was pretty clear that the runtime for each was very different. Ultimately, we reasoned that for an input of size $n$, linear search ran $\propto n$ and binary search $\propto \log_2(n)$. Today, we will look at a way to generalize analysis of the runtime of a particular function in terms of the size of its inputs.

Big-O analysis is a way of reasoning about the asymptotic runtime of a function. If we really wanted to, we could count operations to find a $T(n)$ expression for some function. One of the important aspects of Big-O is that the constant factors in this expression don't matter and, more generally, only the largest term is meaningful. We also don't care about small input--only large inputs.

One justification for this is that the constants we determine are based on some incorrect assumptions. For example, we've assumed that addition takes just as long as multiplication and the function calls have no overhead, which is not accurate. Even if we could determine these relations on a particular system, the results wouldn't generalize to different hardware anyway. There are situations where these constant factors would make a difference, but we will not be concerned with that in this class.

Below is a graph of some common runtime classes, with various constants. Hopefully this helps to convince you that the various constants we use don't matter by the time we get to large input. In some cases, the functions that are better at large inputs are slower for small inputs. However, most functions run "instantly" on small input anyway, so this is also not a problem.

We've already talked a little about being able to use functions with the same prototype interchangeably, so what does this mean in terms of big-O. First of all, if we swap out two functions with the same big-O runtime in some code we've written, we can assume that the runtime of each will be similar. More importantly, say we've written a large program that uses (sorted) linear search, then we can replace it with binary search and expect that it will run faster with no side effects (as long as we aren't assuming anything beyond what's in the prototype!).

Formally, one way to define big-O is $$O(g) = \{f| \exists c > 0 \textrm{ and } n \geq 0 \textrm{ such that } \forall n \geq n_0 f(n) \leq cg(n)\}$$ Don't let the set notation scare you. All we're saying here is, for example, all functions that run in linear time are in the same class of functions.

Now for a simple example: consider $T(n) = 34n^2$. If we let $c = 34$ and $n_0 = 0$ then our analysis satisfies the formal definition.

How about a more complicated case? Consider $T(n) = 9n^3 + 32n^2 + 99$. Intuitively, we can see that this is $O(n^3)$, but how do we show this using the formal definition? $9n^3 + 32n^2 + 99 \leq? cn^3$. We make approximations. For $n \geq 1$ we can say that $n^3 \geq n^2 \geq 1$. Therefore, we can write an inequality stating that $$9n^2 + 32n^2 + 99 \leq 9n^3 + 32n^3 + 99n^3 = 140n^3$$ Therefore $c = 140$. Now, how about $n_0$? This doesn't work for $n_0 = 0$, but works if we let $n_0 = 1$.

Suppose we have two functions $f$ and $g$ where $T_f(n) = 4n$ and $T_g(n) = 33n + 5$. A big-O analysis would allow us to say that $O(f) = O(g)$. You may have noticed that our definition isn't very strict. For example, in addition to saying that $O(f(n)) = O(n)$ it is also true that $O(f(n)) = O(n^2)$. Since we've already defined big-O in terms of set notation, we can also say that $O(n) \in O(n^2)$. In other words, every function that has a linear bound on runtime also has a quadratic bound on runtime. You should be familiar with these relations and be comfortable with the notation. However, if we ask you for the big-O runtime of a function, we are looking for the smallest bounding function. We could explicitly ask for the tight bound (big-$\theta$), but the proof would be more involved.

Here is another interesting result of big-O. Consider functions $f$ and $g$ where $T_f(n) = \log_2(n)$ and $T_g(n) = \log_3(n)$. What can we say about their big-O runtimes? It turns out that $O(f) = O(g)$. Recall the change of base formula for logs: $$\log_a(n) = \frac{\log_b(n)}{\log_b(a)}$$ This holds for all $a, b$. So the two differ by only a constant factor. For this reason, we actually usually don't write any base on our logs in big-O analysis.

How about exponents? Consider functions $f$ and $g$ where $T_f(n) = 2^n$ and $T_g(n) = 3^n$. These are not the same. There is no $c$ such that $3^n \leq c2^n$, regardless of $n_0$. The key to understanding this is that $c$ must be generic, i.e. not tied to $n$. Yes, for a particular value of $n$ we can find such a constant, but we can increase $n$ to the point where that constant would not longer work. More formally, another way of expressing big-O (that you don't need to know) is that $f(n) \in O(g(n))$ iff $\lim_{n\to \infty} \frac{f(n)}{g(n)} = c, c \neq 0$. In this case, the result would be $\left(\frac{2}{3}\right)^n$, which goes to 0 as $n$ goes to $\infty$.

So, how do we take this very abstract stuff and bring it back to analyzing actual code? Well, we sure don't want to have to continue counting operations in everything we write. In general, this comes down to recognizing general patterns in code as being similar to one another. Here are some examples:

• $\log(n)$: Think about binary search. Each iteration through the loop, we effectively cut our input size in half. Any time this happens, we're looking at logarithmic runtime.
• Polynomials: This generally consists of nested loops that iterate over a linear range. Each nested loop adds a higher degree to the polynomial. So what's the runtime of this example? Don't automatically count a new loop as a new power of $n$. One of the loops iterates over a constant range, so it has no effect. Also, keep in mind that the functions we call in the loop body have runtimes of their own. In the example, I say that it's constant time. If, instead, we call a quadratic function inside the inner loop, then the runtime goes up two degrees on $n$. On the other hand, suppose I have two for loops that loop over $n$, that are not nested. This would still be linear. Doing two consecutive linear things is not the same as doing a linear thing inside of a linear thing.
• $a^n$: An example of an exponential function is the recursive version of fib. Each call to this function makes two recursive calls. One way to represent this would be a tree of recursive calls, of depth $n$. A perfect tree has $2^n - 1$ nodes. In this case, each node is one call to the function.

One final note: big-O runtimes can involve more than one variable. Suppose you have nested for loops that loop over $m$ and $n$, with only constant time operations inside, then it would be $O(mn)$. Don't feel compelled to find a way to reduce this to one variable.

Runtime analysis will be pivotal to our discussions of data structures (introduced next week) for the remainder of the semester.

### Selection Sort

The main idea behind selection sort is that we want to go through the array repeatedly and find the smallest element to build a sorted array. So, on the first iteration, we scan $n$ elements to find the smallest one. Then, we swap this element with the one at index 0. We now know that a subarray that only includes the first element is sorted. On the second iteration, we look at the remaining $n - 1$ elements to find the minimum and swap this with index 1. Now we have a sorted subarray of size 2. This continues $n$ times until our sorted "subarray" is simply the entire array. As an independent exercise, you should be able to prove this using the contracts.

After what we covered last time, a logical question to ask is what our runtime is. However, the answer to this question may not be immediately obvious. We have a for loop over $[0, n)$ that calls the linear function min_index. However, the number of indices looked at by min_index changes each time! The first time, there are $n$ comparisons made, $n - 1$ the second time, down to just 1 the final time. So, how do we make any sense of this? Trying to add the total number of indices examined, we get a sum with a well-known closed form: $$1 + 2 + \cdots + (n - 1) + n = \sum_{i = 1}^n i = \frac{n(n+1)}{2} = \frac{n^2 + n}{2}$$ Formal definition aside, we can now clearly see that this function has a quadratic runtime. As an exercise for yourself, show this with a formal proof.

### Quicksort

So, the obvious question at this point is: can we do better? The answer: of course. The quicksort algorithm has three major steps:

1. Choose an arbitrary (random) element of the array. We will refer to this as the "pivot."
2. Go through the remaining elements and sort them into two categories: those that are smaller than the pivot and those that are greater. This is called the partitioning phase.
3. Recursively sort these two subarrays.

I haven't yet mentioned what the actual runtime of this algorithm is. For this, we introduce the idea of an average runtime. So far, when we've discussed big-O, we've been concerned with the worst case runtime. We will not rigorously prove this, but I claim that quicksort has an average runtime of $O(n\log n)$ and a worst case runtime of $O(n^2)$. A rough justification for this relies on how the size of the two partitions compare. In the average case, we expect that each partition will be roughly equal in size. Therefore, the partition phase is linear and we will have a logarithmic number of recursive calls. On the other hand, in the worst case, one partition will contain all of the elements and the other none. The reason that we choose a pivot randomly is that, for any systematic method of choosing pivots, I could devise a particular list that exhibits this worst case behavior. For example, suppose I always choose the first element as the pivot. In this case, a list that is already sorted has this worst case behavior. It turns out that, randomly choosing pivots gives an $n\log n$ runtime on average. An alternate method is to choose 3 random elements and take the middle of the three to be the pivot.

As we can see in the implementation, the bulk of the work of quicksort is actually done in the partition function. We also use a few tricks to write a good implementation. Initially, the pivot is just some element in the middle of the list. As is, there isn't really an easy way to partition the remaining elements, since we don't know how big each partition will be. We also don't want to have to allocate any extra to do this job. The trick here is that we swap the pivot with the last element in the list. This means that all of the elements we need to partition are all together in the array.

But how do we actually create the partitions? We know that, for each of the remaining elements, it is either less than or equal to the pivot, or greater than it. This may seem trivial, but it's actually really important. Suppose that the pivot has index $i$ in the final, sorted list. Therefore, we know that every element greater than the pivot must be at an index greater than $i$. The reverse argument is true for the elements less than or equal to the pivot.

So, we begin by letting left = lower and right = upper -2 (-2 because A[upper - 1] is holding the pivot). Essentially, starting at lower, we inspect each element. If A[lower] <= pivot then it belongs in the left-hand partition. So, we leave it there and move on to the next element. On the other hand, if A[lower] > pivot, then it belongs in the right-hand partition. We deal with this by swapping it as far to the right-hand side of the array as possible, just to the left of the last element we've placed in the right-hand partition. The element we swap with is being pointed to by upper. Initially, this is just the last element. Afterwards, it is the highest index we have not yet explicitly placed an element at. In other words, we build the 'less than or equal to' partition up from the smallest index and we build the 'greater than' partition down from the largest index.

You may wonder why our loop condition is left <= right and not just left < right. If you carefully consider the boundary case, quitting when they are equal means that we haven't really examined one element yet. We still need to decide which partition this last element belongs in. A subtle, but important point.

After we've finished the partition, we do one last swap between the pivot and left. At this point, left actually points to the first element in the right-hand partition. So, what does this accomplish? It turns out that now the pivot is right in between the two partitions. More importantly, since everything to the left is less than or equal to the pivot and everything to the right greater, we can actually say that the index of the pivot is its index in the final sorted list! We then return this index so we know where to divide the array for the recursive calls. We have two recursive calls: one on the resulting left-hand partition and the other on the right-hand partition. Eventually, we hit our base case--we know that the single-element array and the empty array are sorted.

For example (29 is the randomly chosen pivot):

 96 25 18 31 60 30 17 12 38 29 swap pivot (29) with 60 96 25 18 31 29 30 17 12 38 60 initially: left=0, right=8 38 25 18 31 60 30 17 12 96 29 Swap 96 and 38 left=0, right=7 12 25 18 31 60 30 17 38 96 29 Swap 12 and 38 left=0, right=6 12 25 18 31 60 30 17 38 96 29 Keep 12 left=1, right=6 12 25 18 31 60 30 17 38 96 29 Keep 25 left=2, right=6 12 25 18 31 60 30 17 38 96 29 Keep 18 left=3, right=6 12 25 18 17 60 30 31 38 96 29 Swap 31 and 17 left=3, right=5 12 25 18 17 60 30 31 38 96 29 Keep 17 left=4, right=5 12 25 18 17 30 60 31 38 96 29 Swap 60 and 30 left=4, right=4 12 25 18 17 30 60 31 38 96 29 Swap 30 with itself left=4, right=3 12 25 18 17 29 60 31 38 96 30 Replace pivot: swap 30 and 29

After many recursive calls, you will have the sorted array.

### Aside: Mergesort

We also talked about mergesort in lecutre. Mergesort actually has an average and worst case runtime of $n \log n$. However, in practice, mergesort generally runs slower than quicksort. This seems a little disappointing, since big-O would tell us that mergesort is probably the better choice. In addition to time complexity, there's also something called space complexity--how much extra memory a program uses. Mergesort works by copying the elements into new memory at each step. As we noted earlier, we implement quicksort cleverly so that we don't need to allocate any extra memory. This is called an "in place" algorithm. It turns out that copying data in memory is relatively expensive, so the in place algorithm is faster for this reason. Here's a comparison of all 3 sorts (selection sort in yellow, mergesort in blue, quicksort in red).

Code used to generate this data.

If you have any comments or see any errors, please let me know.