Okay. Welcome to 364a, which is convex optimization one. Unfortunately, I have to start class by apologizing. If youfre watching this, I guess youfd say -- if youfre watching this at the scheduled time of the class, youfll notice that Ifm not there. Thatfs because Ifm in India. You can actually get very confused doing this. Therefs actually one more confusing thing than this, and thatfs when you tape ahead of class before it actually -- it goes after a class you havenft given yet. That gets very complicated. Ifm sorry to miss the first day of the class. As I said, Ifm in India or will be, depending on where the time reference is. Thatfs very tacky. Ifve never done it before, so this is a first. Actually, wefre going to make a lot of firsts today. I believe we may be making SCPD history because wefre taping ahead this lecture not only the quarter before but the year before. A lot of people are very excited about this. If this works out well, who knows where this is going to go. I could teach all of my spring quarter class the last few weeks of winter quarter. Wefll see how this works. Wefre possibly making tape ahead history here. A couple things I should say about the class, other than apologizing for not being present or, for that matter, in the country for the first day of it. Ifll say a couple of things, and then wefll get started. For those of you watching this, since youfre not going to get to see the audience and know that I sent out a pitiful plea to come if youfre around for the tape ahead, you should know the room is not empty. Let me say a little bit about the class. I think some of it is kind of obvious. We barely got the website up and running. We change it every day. Let me say a few things about it. The first is the requirements -- there is going to be weekly homework, which if you are interpreting this relative to January 8, I think the correct statement would be ghas already been assigned.h If youfre interpreting it in real time, itfs actually fall, so we havenft even thought about homework one, but we will. Itfll all be ready to go. Therefll be weekly assignments, and there will be a take home final exam, which will be the standard 24-hour take home exam -- the usual format. Youfll do serious work. The other requirement for the class -- itfs not really a requirement, but there will be minimal Mat-Lab programming. Itfs not a big deal, but there will be Mat-Lab programming, and that will be done using a system called CVX, which if you go to the website, you can find out about. Youfre welcome to install it and read the user guide now. Therefs no reason not to. Youfll be using it throughout the quarter. You might as well look at the user guide, install it, see if it works. Not everything in the user guide will make sense until you get to week four of the class, but therefs no reason you canft start poking around with it now. When we get to it, you should be extremely grateful for the existence of this stuff, because it trivializes the gprogrammingh that wefll do in the class. Programming is in quotes, and the reason is wefre talking things like ten lines. Itfs not a big deal. By the way, on that topic, I would say this -- if therefs anyone that -- CVX is based on Mat-Lab. Thatfs the way it is. If therefs anyone who wants to try to use a real language, like Python, I will be very happy to support you in that. The TAs are Yung Lung, who I guess you canft see, and Jacob Mattingly, whofs in New Zealand, but will not be in New Zealand when this is played back. Jacob would be very happy to -- if therefs one person who does all the numerical stuff in Python. Ifll just say that, and feel free to come forward and let us know. The only other thing I would say is something about the prerequisites. The prerequisites are basically a working knowledge of linear algebra. 263 absolutely -- itfs clearly the perfect background. If not, what we mean is a working knowledge, so not the ability to know about these things but actually how you use them in modeling and things like that. The other one is the mathematical sophistication level is going to be a jump up from 263. This is a 300 level class, so the mathematical sophistication is going to jump up. The cool part is the mathematical sophistication jumps up but actually is still in some sense a very applied course in the sense that wefre not just going to talk about optimality conditions. You will do portfolio optimization. You will do optimal control. You will do machine learning. I mean actually do numerical stuff. It wonft just be stupid programming. Itfs going to be figuring out whatfs going on. It wonft be obvious. Itfll be good stuff. Ifm trying to think what else to say before we start in on the class. Ifll just set it on context how itfs different from 263. 263, which is a linear dynamical systems class -- I would call that basic material. Itfs useful in tons of fields. Itfs very widely used. A lot of people know it. Itfs used in economics, navigation -- all of these areas. Control and signal processing is basically all done this way. Communications -- a lot of stuff is done this way. Itfs a great first look at how these mathematically sophisticated methods actually end up getting used. 364 revisits it. Itfs quite different in a couple ways. The first way is you might even just say this is 263 beyond least squares. In other courses similar to that, so in your first look at statistics and your first course on statistical signal processing or whatever -- itfs the same sort. Everything is Galcean. All distributions are Galcean. All objective and constraints are quadratic. You do this because therefs analytical formulas for it. Behind the analytical formulas, we have very good computational tools to give computational teeth to the theory. Here, it is going to be way beyond that. Wefre going to have inequalities. Wefre going to have all sorts of other interesting norms and other functions. Itfs a much richer class of problems. These are problems much closer to a lot of real problems, ones where -- you donft have the concept of an inequality in 263. You have no way to deal with noise with a [inaudible] distribution in your first class on statistical estimation. Wefre going to look at things like that. I should also say that the number of people who know this material is relative to 263 very small. Itfs probably a couple thousand. Itfs growing rapidly, but itfs just a couple thousand. That means youfre going to move from material -- a lot of the 263 material is kind of from the 60s and 70s. Itfs not stuff thatfs new. This class, in contrast -- youfre at the boundary of knowledge, which makes it fun. Maybe youfll even push the boundary. I hope you do, since thatfs the point of the class to train you to find your corner of the boundary and start prospecting. I canft think of anything generic to say, so maybe wefll actually start the class, unless there are any questions. [Student:]If you havenft taken 263, is it a big problem if you have a working knowledge of linear algebra? No, no problem. Whatfd you take instead? No problem. I should also say this -- something about the background. The class is listed in electrical engineering. Ifm not actually sure why. The last I checked, thatfs the department Ifm in, so it seemed like a good idea all around. You donft need to know anything about electrical engineering -- in fact, wefll talk about lots of applications throughout the quarter, and honestly, probably for half of them, I donft even know what Ifm talking about. That will happen. Youfll be in some field. Ifll super oversimplify it. Youfre welcome to point it out. The point is, Ifll talk about circuit design. Wefll talk about machine learning. On all of these topics, there will be people in the class who know more than I do about it, and there will be a lot of people who donft know anything about it. If youfre one of those latter, donft worry about it. Theyfre just examples. There is one prerequisite, although I donft know what would happen if you didnft have it. That is that I will assume that you know about basic probability estimation and things like that. Do I have to say anything about the textbook? Go to the website. Thatfs all I have to say. Therefs a textbook. Youfll find it at the website. Ifll start by covering broad basics. Itfs not in any way representative of what the class is going to be like, what Ifm going to talk about today. Donft think it is. Wefre going to cover the basics, and maybe Ifll get into something thatfs real. This is going to be the very highest level. Wefll talk about mathematical optimization. Wefll talk about least squares and linear programming, convex optimization. Wefll look at a very simple example, and Ifll say a little bit about the course goals and style. Letfs start with this. Optimization. Depending on the company youfre in, you may have to put a qualifier -- mathematical. If you say just optimization, it goes to something on complier optimization or something in business practices. If youfre in a broad crowd like that, you need to put mathematical in front to make it clear. Herefs what it is. You want to minimize an objective function. You have a variable X with N components. These are loosely called the optimization variable. Colloquially, you would say that X1 through XN are the variables. By the way, there are other names for these. People would call them decision variables. Thatfs another name for it. In a different context, they would have other names. Decision variables is a nice thing because it tells you these are things you have to decide on. When you make a choice X, youfll be scored on an objective. Thatfs F0 of X. Wefll choose to minimize that. In fact, you ask what if you wanted to maximize it? Then you would just minimize the negative of that function, for example. Itfs convenient to have a canonical form where you minimize instead of maximize. This will be subject to some constraints. In the simplest case, wefll just take a bunch of inequalities -- thatfs FI of FX is less than BI. Here, without any lost generality, the BIs could be zero, because I could subtract BI from FI and call that a new FI. Therefd be no harm. This is a nice way to write it because it suggests what really these are. In a lot of cases, these are budgets. Itfs a budget on power. It often has an interpretation as a budget. There are other types of constraints wefll look at. What does it mean to be optimal for this problem? By the way, this is redundant. You should really just say an optimal point or the solution. In fact, you shouldnft trust anyone who says optimal solution, because that would be like people who say the following is a true theorem. Itfs kind of weird. Itfs redundant. It doesnft hurt logically, but it kind of -- what about the types when you forgot to add the attribute true? What were those theorems? Ifd say an optimal point or a solution is a point that satisfies these inequalities here and has smallest objective value among all vectors that satisfy the constraints. Thatfs what you mean by a solution. Of course, you can have everything possible. You can have a problem where there is no solution. You can have a problem where therefs one, where therefs multiple, and you can have other things, which wefll get to more formally. This is just our first pass. Letfs quickly look at examples. Here are some absolutely classic examples from all over. The first is portfolio optimization. In portfolio optimization, the decision variables X1 through XN -- these might represent the amount you invest in N assets. For example, if XI is negative, it means youfre shorting that asset, unless that asset is cash, in which case youfre borrowing money. The X1 through XN is actually a portfolio allocation. Youfll have constraints. Obviously, there will be a budget constraint. That tells you the amount that you have to invest. There might be maximum and minimum investment per asset. For example, the minimum could be zero. Youfre not allowed to actually purchase a negative quantity of an asset, which would be shorting that asset. You can have no shorting. You might have a maximum amount youfre allowed to invest in any asset. You might have a minimum return that you expect from this portfolio. Your objective might be something like the overall risk or variance in the portfolio. Youfd say herefs 500 stocks you can invest in. My mean return absolutely must exceed 11 percent. Among all portfolios that meet -- Ifm going to have no shorting, and the sum of the XI is going to equal one because Ifm going to invest one million dollars. The question is among all portfolios, they guarantee a mean return of 11 percent. I want the one with the smallest variance in return. That would be the one with the least risk. Wefll look at these in detail later. This is one where when you solved the optimization problem, it would probably be advisory in the sense that youfd look at what came back and say well, how interesting. Maybe Ifll execute the trades to get that portfolio or maybe not. This could also be real time if youfre a hedge fund and youfre doing fast stuff. This could be programmed [inaudible]. There are lots of different things this could be used for. Letfs look at [inaudible] electron circuit. Here, you have a circuit. The variables, for example, might be the device widths and lengths. It could be wire widths and lengths, something like that. You have constraints. Some of these are manufacturing limits. For example, depending on what fabrication facility is making your circuit, you have a limit on how small the length of a gate and a transistor can be. It canft be any smaller than 65 nanometers. Thatfs the smallest it can be. Thatfs a manufacturing limit. You would also have performance requirements. Performance requirements would be things like timing requirements. You could say this circuit has to clock at 400 megahertz, for example. That tells you that the delay in the circuit has to be less than some number because it has to be done doing whatever itfs doing before the next clock cycle comes up. Thatfs a timing requirement. You might have an area requirement. You could say if I size it big, Ifm going to take up more area. This portion of the circuit canft be any more than one millimeter squared. The objective in this case might be power consumption. Youfd say among all designs that can be manufactured in the sense that they donft reject your design instantly and meet the timing requirements among all of those, you want the one or a one with least power. That would be minimized power consumption. Our last one is just from statistics. Itfs a generic estimation model. Generic estimation looks like this. Whatfs interesting here is these are from engineering and finance and stuff like that. In these cases, the Xs are things youfre going to do. In this case, theyfre actually portfolios youfre going to hold, and in this case, they will translate into polygons that get sent off to the fabrication facility. The [inaudible] are actions. Here, itfs very interesting, because the XIs are not actions. Theyfre actually parameters that youfre estimating. Here, you have a model. You can take any kind of data or something like that and youfll have parameters in your model. You want to estimate those parameters. These XI are not things youfre going to do. These are numbers you want to estimate. Thatfs what it is. You have constraints. For example, letfs say that youfre going to estimate a mean and a covariance of something. It can be much more sophisticated, but letfs suppose thatfs the case. We have a constraint here. There might be some constraints. Herefs a constraint. The covariance matrix has to be positive semi definite. Thatfs a constraint, because if you created a covariance matrix that wasnft, youfd look pretty silly. Thatfs a nonnegotiable constraint. You could also have things like this. You could say these Xs must be between this limit and that limit. For example, suppose youfre estimating some diffusion coefficient or some parameters known to be positive. Then you should make it positive. These are parameter limits. The objective is, generally speaking, dependent on how you want to describe the problem. If you describe the problem in an informal way, itfs a measure of misfit with the observed data. For example, if I choose a bunch of parameters, I then propagate it through my forward model and find out what I would have had, had this been the real parameter. The question is, does it fit the observations well? Another way to say it -- itfs a measure of implausibility. Thatfs really what it is. Itfs a measure of implausibility. In this case, wefre minimizing it. In many contexts, youfll see it turned around so itfs maximized. If youfre a statistician, you would reject the idea of a prior distribution on your parameters, and your objective would be to maximize the likelihood function or the log likelihood function. Thatfs what youfd be maximizing. Thatfs essentially a statistical measure of plausibility. Youfd minimize the negative log likelihood function, which I think they call loss in some of these things. Implausibility in a [inaudible] framework, a measure of implausibility would be something like the negative log posterior probability. That would be a measure of implausibility. If you minimize that, youfre actually doing MAP, which is maximum a posteriori probability estimation. By the way, wefll cover all those things again, so this is just a very broad brush. If you donft know what these things are, you will, if you take the class. These are examples of optimization. Everyone in their intellectual life goes through a stage. It can happen in early graduate school, mid graduate school. It can also happen in later life, which is bad. Itfs not good to have it when youfre an adult. Let me describe this stage of intellectual development. You read a couple of books and you wake up at 3:00 in the morning and say oh my god, everything is an optimization problem. Actually, a lot of books start this way. My answer to that is -- you have to go through this stage, so thatfs fine. But get over it quickly, please. Of course, everything is an optimization problem. What youfll find out quickly is it doesnft mean anything to say that. It says nothing. What matters is what optimization problem it is, because most optimization problems you canft solve. They donft tell you that. Typically, they donft tell you this. Or, what they do is they do tell you, but they distribute it through the whole quarter. It turns out if you just say a little bit of that message every day, at the end of the quarter, no one can accuse you of not having admitted that we donft know how to solve most optimization problems. However, because it was below the threshold of hearing in each lecture, as a result, all these students went through and the big picture didnft come out, which is basically -- the way you cover it up is by going over 57 different types of algorithm for solving things, which basically is a cover up for the fact that you canft solve anything. Wefll get to that. This is related to the following -- solving optimization problems. To say that everything is an optimization problem is a stupid tautology. It all comes down to this. How do you solve them? It turns out itfs really hard, and basically in general, I think itfs fair to say it canft be done. I can write down shockingly simple optimization problems and you canft solve it, and itfs very likely no one else can. I can write down very innocent looking optimization problems and theyfll be NP hard. What do people do about it? Well, there are a couple of ways to do it. What do I mean by NP hard? Nondeterministic polynomial ties. You take at least quarters on this in a computer science class. Ifm going to give you the 15-second version. Ifll tell you about it from a working perspective. In the 70s, people identified a bunch of problems that so far no one had figured out a good polynomial time algorithm for solving. Polynomial time means therefs an algorithm where the problem comes in, you measure the size by the number of bits required to describe -- you measure it by the number of bits required to describe the data, and then you look at the number of operations you have to do. If you can bound that by a polynomial, then youfd say thatfs polynomial time. Bear in mind, Ifm compressing a ten week course to about 14 seconds. Ifm going to gloss over some of the details. There were a bunch of problems where people -- the famous one would be traveling salesmen problem. No one had found a polynomial tying method. A guy named Cook, maybe, did the following. He catalogued a bunch of problems and said if you can solve any of these, then if you make a method for solving that, I can make a reduction. I can take your problem and map it to this, and with that, I can solve the traveling salesmen problem. Then you had an intelligent way of saying of two problems, one is just as hard as the other. NP hard means, and this is really -- people are going to cringe if they have a background in this -- it means itfs at least as hard as a catalog of problems thought to be very hard. Thatfs your prototype. Basically, what Ifm saying is not true, but as a working definition of what it means -- for your purposes, this is going to be fine. It means the problem is at least as hard as traveling salesman. Let me tell you what that means. It is not known that these things cannot be solved from the polynomial tie. Thatfs not known. It is thought that thatfs the case, and it may be at some point, somebody will show that you canft solve these. Right now, theyfre thought to be harder. I think therefs an insurance policy. Let me tell you why itfs an insurance policy. A ton of super smart people have worked on all these problems, and now all of these things are banded together as you would do in insurance. You band a whole bunch of things together. What would happen is if tomorrow, somebody, probably a Russian, proves P equals NP, meaning you can solve all of these problems, it would indeed be embarrassing for a lot of people. However, the embarrassment is amortized across -- people could say you went around and made a big deal about convex problems and polynomial time. I say look at all the other people. The embarrassment is spread across a great swath of people. Itfs an insurance policy. It is thought that theyfre really hard. If theyfre not, youfre in very good company with the people who also thought they were hard. [Student:]Itfs just in its own field. Itfs not proven to be certain. Itfs not mathematically proven to be non-polynomial. Thatfs right. Thatfs why itfs still an open conjecture. If, in fact, it turns out that these are theoretically not hard, the [inaudible] could end up being huge, and that would also blunt the embarrassment. In any case, the embarrassment is spread across a lot of people. Wefll come back to that problem several times in the class. Methods to solve a general optimization problem -- they always involve a compromise. The two main categories of compromise are as follows. The first one is to not back down on the meaning of solve. Solve means solve. It means find the point that has least objective and satisfies the constraints. Thatfs the definition of solve. You leave that intact, but you leave open the possibility that the computation time might involve thousands of years or something like that. People call that global optimization, and itfs a big field. It is almost entirely based on convex optimization. The other one, which is the one most widely done by people who do optimization -- they do the following. Itfs very clever what they do. They go and solve -- they put a footnote, and then way down at the bottom, they write this -- they write and now Ifll have to ask you to zoom in on that. There it is. It says not really. What that means is they make a big story about this, and they say itfs a local solution. What happens is they donft -- it gets tiresome to say solve locally, plus, it doesnft look good. What happens is you drop it after awhile, and then you say I solved that problem, and people give talks and say I minimized that. The point is they didnft. What they did was they got a local solution. Wefll talk more about that as well. There are exceptions. If you were going to do that, I would maintain, although this is not widely held -- youfll find that many of my opinions are not widely held. If you were going to do that, I would do it via convex optimization. [Student:]But you will be doing it in the spring? Yes. Itfs scheduled to be in the spring. Who knows when Ifll do it. What are the exceptions? There are cases where you can solve these problems with no asterisk. The most famous one in the world by far is least squares, least norm. No asterisk there. Itfs not like oh, yeah, well, I transpose A inverse A transpose B. Yeah, that often works super well in practice. The status of that is that is the global solution. There are a couple others that you might not know about, and that would be things like linear programming problems. Wefll get to those quite quickly. Those are ones where you get the answer. There are asterisks in these, but theyfre much smaller fine print. Ifll get to them later in the class. When therefs an admission that has to be made, I will make it. The parent of these, and a considerable generation, is this convex optimization problem. This is really the exception. The rough idea, to put it all together, is something like this. Most optimization problems -- letfs review what wefve learned so far. A -- everything is an optimization problem. B -- we canft really solve most optimization problems. The good news is here, that there are huge exceptions. This entire class is about one of the exceptions. When you do convex optimization, there are no asterisks. You solve the problem and there are no apologies. Itfs the global solution. Is life convex? I would have to say no, itfs not. I hope itfs not. Itfs not sad. If we get to that later today, youfll know why itfs not. To check convexity, you have to take two instances and then form weighted averages in between. What would the average of yours and your life look like? The other thing that has to happen is that life has to be better than the average of the qualities of your lives. Letfs keep that as a running topic that wefll revisit periodically. For the moment, my position is itfs not. [Student:]Are there other exceptions besides convex optimization? Yes, there are. Singular value decomposition. Thatfs one where our can compute the singular value -- I can write it out as an optimization problem pretty easily. I could say find me the best rank two approximation of this matrix. Thatfs way non-convex. Yet, we can solve it and we get the exact global solution. Thatfs an example. There are some combinatorial problems. So if youfve taken things on combinatorial algorithms in computer science -- combinatorial algorithms on their face would appear to be non-convex. It turns out a lot of them are convex. Itfs quite strange. You take something thatfs a combinatorial optimization problem that on its face would not be. It turns out if you let the variables vary between zero and one, you can prove that therefs always a solution, which is on the vertices, and so therefs actually a lot of problems that we can solve but are not convex. Some of them can be secretly turned into convex problems. Getting a rank two approximation of a matrix is an excellent example. We can definitely do that and it is definitely not convex. We have least squares or, if youfre in statistics, regression. It may have other names. I donft know. Herefs the problem. You want to choose X to minimize the [inaudible] norm squared of AX minus B. If A is full rank or skinny, you get an analytical solution, which you know if you know about linear algebra. Itfs just A transpose A inverse A transpose B. In this case, itfs a unique solution. In fact, we have a formula for it, which is overrated. In this class, wefre going to look at tons of problems. There will be analytical formulas to almost none of them. Youfll have to wean yourself away from analytical formulas. The sociology of that is very interesting. Youfve been trained for 12, 14 years on 19th century math, which was all based on analytical formulas, but wefre going to move beyond that. If you look most deeply into what it means to have an analytical formula, it turns out -- wefre going to solve AX minus B with an infinity norm there or a one norm. Therefs no analytical formula for it now. But it turns out we can calculate that in the same time it takes for you to calculate this. Having a formula is something that will mean less to you by the end of this class. [Student:]So not all optimization problems have that subject. When we go over this in hideous detail later, thatfs the case. I should mention you should be reading chapter one and chapter two. Chapter one will have a lot of this vague stuff, and chapter two will start in on the real stuff. Youfre absolutely right. An optimization problem -- you do not have to have any constraints, in which case itfs called unconstrained, and you donft even have to have an objective. If you have neither, thatfs called a stupid problem. Itfs minimized. The universal solution -- X equals zero. If you donft have an objective, itfs called a feasibility problem. In some fields, they call it a satisfaction problem. You have a bunch of inequalities and you want to find a point that satisfies all of them. Thatfs what that is. Back to least squares. Much more important than the formula -- it turns out you can write down a formula for a lot of stuff, and it doesnft do you any good if you actually want to calculate it. Here, there are reliable and efficient algorithms that will actually carry out this computation for you. By the way, thatfs why least squares is so widely used because it has computational teeth. Instead of just talking about it, which is what a formula would allow you to do, you can actually use it. You can actually compute. Wefre not going to go into too much detail in the numerics. At one point in the class, we will. Wefll talk about how you exploit structure and things like that. Here, just so you know, the computation time to solve a least squares problem goes N squared K. N is the number of variables, and thatfs the small dimension. K is the number of rows in A. Itfs a good thing to know. Itfs the small squared times the big. You can do this as I said not long ago in 263 with modern computing. Itfs amazing what you can solve. Then, we did a couple thousand row least squares problem. You can call it a regression problem in 2,000 dimensions with 1,000 regressors. It was three seconds. By the way, if A is sparse or has special structure -- suppose part of A has an [inaudible] embedded in it. That would come up in medical imaging. You can do that faster. In image processing, youfd have transformations. You can solve least squares that are way bigger. I would say that least squares is a mature technology. When I do this, people who worked on all of this -- itfs a huge, active field in lots of areas -- they get extremely angry when I use the word technology. I said by the way, I mean technology here. This is the highest praise. This is not an insult. What it means is that other people know enough about the theory, the algorithms, and the implementations are so good and so reliable that the rest of us can just type A backslash B. Of course, if youfre building some real time system or the sizes get to a million or ten million, youfre not going to be using backslash. But thatfs it. Thatfs a boundary thatfs growing with time. Thatfs the wonderful thing about anything that has to do with computers. Just take a vacation for a year. My colleagues at the other end of the [inaudible] who actually do real things, they and all their friends around the world will make stuff twice as fast. You just go away. You go to the Greek Islands for three weeks. You come back and computers are faster. Itfs great. Of course, Ifm not telling people to just use A backslash B. Everyone here has done A backslash B. Probably only a few know what it actually did. Nothing terrible has happened. Ifll come back to them and when theyfre yelling at me, Ifll say back off. Do you use TCP/IP? Sometimes, they wonft even know what Ifm asking. Then Ifll say are you using TCP/IP as a black box? You donft even know whatfs inside it? Some of them will say yeah, I do. Even communications on your laptop between different components -- Ifll say do you know what it does? Most of the time, theyfll say no. Herefs what you need to know. It is not trivial by any means to make a reliable bit pipe to transfer bits from one place to another with unreliable medium insight. Itfs no more trivial than it is to solve this problem numerically. It is not trivial. All you need to know is this. Very intelligent people have thought about this problem very deeply. Theyfve thought about what can go wrong, deadlocks, all sorts of crazy stuff, and they have come up with algorithms about which they know an incredible amount. Part two -- other people have implemented them, and these are very reliable implementations. The result is for most of us, we can just consider certain things to be reliable bit pipes. We donft have to care about how many packets were retransmitted or how the flow control went and all that kind of stuff. Like least squares -- if youfre doing real time control or something like that or if youfre doing some computing that requires extreme reliability, then you canft treat TCP/IP as a black box. You might ask does this calm them down? The answer is no. This makes them more angry. I mean technology here in praise of this. When you use least squares -- if youfve just come from 263, youfve used least squares. Thatfs just the beginning. The way you use least squares is this. Itfs easy to recognize a least squares problem. Sometimes, it comes to you on a platter. Somebody walks up to you and says I took these measurements. Itfs a linear measurement model. Help me guess these parameters. I received a signal and went to this channel -- this kind of thing. If you learn a few tricks in least squares, you can do well. If you learn how to do weight twiddling and you learn about regularization, those two alone -- you are now trained and ready to do battle with using least squares as a tool. You will do really well. Weights is basically you go into a problem and someone says therefs no limit on the sum of the squares. The limit is on this. You say no problem. I have these weights here. You look at the least squares solution. You donft like what you see. You change the weights and you do it again. In engineering, we do this all the time. Itfs called weight twiddling. Itfs a slightly derogatory term, but not bad. Ifm sure that you do it in statistics, but I donft know that they admit it. Good. When Ifm making fun of a department, I like to have a face to look at. They like to stick to -- if youfre doing real statistics, you go back in and change the prior distribution. I have to warn you, all of these lectures are being put on the web. Itfs weird and fun. Wefll fuzz out your voice, and if the camera focused on you, wefll put the squares on it. Donft worry about it. If youfd like, we can obscure your department. We can beep it out. Ifm just going to guess that in statistics, they make a big story about the prior distribution. I bet if they donft like the way it comes out, they go back and change that prior distribution. They cover up their tracks. We do it in engineering, and wefre not embarrassed. I bet they do it. Next topic is linear programming. Some of you have seen this. How many people have seen linear programming somewhere? A bunch. Okay. Linear programming -- in my opinion, itfs what people should learn immediately after least squares, singular value decomposition, linear programming. If youfre going to stop somewhere, thatfs a very good place. Ifm taking about if you really want to go out and field algorithms, do real stuff -- thatfs a great place to stop. Everybody should know about it. If you take this class, youfre going to know a lot about it. Linear programming is a problem that looks like this. Minimize a linear function subject to a bunch of linear inequalities. Wefre going to talk about it in horrible detail later in the class, so Ifm not going to go into too much detail now. I want to talk about the attributes of linear programming. The first is in general, except for very special cases, therefs no analytical formula for the solution. There is no A transpose A inverse A transpose B. By the way, donft confuse that with our inability -- you can say a huge amount of qualitative, intelligent things about a linear program. What you canft do is write down a formula for the solution. There are reliable and efficient algorithms and software. In fact, as a homework exercise, you will write something in the top class linear -- it will be 50 lines of Mat-Lab or Python, and youfll write something that will solve huge linear programs quite well. Itfs no asterisk on solve -- you get the solution. The computation, by the time, is proportional to M squared N. Thatfs exactly the same as least squares. If you equate rows of the least squares objective with constraints, itfs identical. Thatfs not an accident. M is a number of constraints here, and K is the height of A. [Student:]So this M -- therefs still that many rows in A. Yes, if I write that as a matrix inequality, AX less than B, yes, that would be -- this M would be that K. We spent hours discussing whether this should be M or K, and we finally -- it probably should be K. [Student:]What about C? X is an RN, so C is an RN, too. Actually, linear programming -- itfs a very old topic. Fourier wrote a paper about it. The modern era starts in the 30s, where else but Moscow. The modern era traces back to Stanford and George Dantzig. LP was something that you just talked about until you had computers, at which point LP looked a lot more interesting. That was 1948. I think a lot of people knew about linear programming. Something like this coupled with computers -- thatfs a mature technology. Linear programming -- it looks like it would be easy to recognize, and in some cases, a problem really comes to you on a platter like this. Someone comes to you and says help me solve this problem. I want to minimize this weighted sum of the variables and I have some budget constraints. There are three problems that have exactly this form. Herefs the really cool part about linear programming. You will be stunned later in the class when you see the kind of problems that we can reduce to linear programming. Things that do not look like linear programming at all we will reduce to linear programming. What that means is theyfre solved. Unless your problem is huge or you have some super real time thing like in communications, then therefs a sense in which youfre kind of done at that point. If you do medical imaging, thatfs a mistake, because the problems are too big. You canft say itfs a linear program and walk away. You canft do communications. It depends on the time scale. You had to adapt to things. You canft detect the bits transmitted in a packet, because thatfs coming at you way too fast. I would recommend that you go into fields that are in between in size. Wefre going to see a bunch of tricks. Finally, Ifll say what convex optimization is, because it would seem in a class with that title, one should say what it is in the first lecture. Herefs what it is. Itfs an optimization problem -- minimize an objective subject in constraints. Herefs what has to happen. The function F0 and the FIs have to be convex. You know what convex is. It means that the graph looks like that. Thatfs a convex function. The graph should have positive curvature. Least squares problem has that form because if I look at the least squares objective and I look at the plot of it, itfs a quadratic function squared, and basically, it looks like a bowl. If you take a slice at a level set, you get an ellipsoid. Itfs convex. Linear programming also is a convex problem because all of the objectives are linear. Linear functions are convex. Linear functions are right on the boundary. They have zero curvature. One way to say convex is just positive curvature. This includes least squares, and kind of the central idea at the highest level of this class is this. If you want to solve a convex optimization problem, there are no analytical solutions. There are in special cases. Wefll see tons of cases in communications and various other places where they have special analytical solutions. Youfve seen one in least squares already. In general, there isnft an analytical solution. However, there are going to be reliable and efficient algorithms for solving these with no asterisk. You will get the solution. In fact, if someone came from another planet and landed here and asked you what youfre doing, there would be -- it would be very difficult to make an argument that solving a convex optimization problem compared to a least squares problem was, for example, that youfd been reduced to a numerical solution, which is what a lot of people might say with a hint of -- they donft like the idea. They say numerical method in a way that makes you want to go wash your hands. The computation time for solving convex problems is roughly proportional to something like N cubed -- the number of variables cubed -- N squared M and F, where F is the cost of evaluating the functions and their first and second derivatives. We donft have to get into that, but the point is itfs like least squares. You can basically solve these. You will solve them in this class. Youfll know how to write the algorithms to solve them and stuff like that. It is an exaggeration, in fact, to say itfs a technology. Itfs almost a technology, and every time I give this class, itfs getting closer and closer. I should say something about -- wefll get to it later, but this is a very profound statement. It basically identifies a huge class -- letfs review what we know so far. A -- everything is an optimization problem. B -- we canft solve any optimization problems despite ten or twenty weeks of lectures on it. Now what Ifm saying is on the positive side. Itfs really cool, and itfs not obvious at all. It says if the objective and the constraints all have positive curvature, then this thing is just like least squares. You can solve it exactly. You can solve it quickly. Although Ifm not going to focus on it, you can say a whole lot about it theoretically. We will say some stuff about it theoretically, but thatfs not going to be the focus of the class. This is a pedantic way. You might prefer it if I wrote it this way. Four Theta N zero one. You might prefer it that way. [Student:]I donft know why it has to be zero one, though. Why canft it be 99 and 100.5 or negative five? There are versions where there are different constraints. If I just say -- it said F of Alpha X plus Beta Y is less than or equal to -- if itfs Alpha F of X for all Alpha and Beta, the function would be called sub linear. Itfs a different thing. This is just a definition of convex. [Student:]It doesnft have any physical basis or anything. Itfs not gonna turn concave if you suddenly make it more than one. Wefll answer your question. This is not an accidental. For now, thatfs the definition. [Student:][Inaudible] just a statement of the [inaudible] of the line between -- Thatfs what it is. Itfs this picture. By the way, wefre going to get to this later, so youfre not supposed to be understanding everything. First of all, Ifm not saying that much. The stuff I am saying is kind of vague. Youfre not supposed to be parsing this. Youfre supposed to have your relaxed parsing mode on and let me get away with stuff. Therefll be a time for is that really a minus sign. Thatfs coming later. You already know something you didnft know. Optimization problems where the objectives and constraints have positive curvature, roughly speaking, we can solve. The theoretical implication, I think, is extremely interesting. The practical ramifications of this are absolutely immense. It means youfre messing on some problem in your own field, and if you get it to a problem of this form -- it wonft be obvious. Itfs much cooler when you get to it and you turn things around and you switch some variables around. All the smoke clears. There are some functions there that itfs not remotely obvious theyfre convex. You show they are, and then youfre a hero, and itfs fun. It means you can now solve those problems. I give a talk about these things lots of places. People say thatfs really cool. How do you learn about this? How do you know if a problem is convex or not? I go no problem. You just have to come take my class. You do ten homeworks, each of which takes 20 hours and then you do this 24-hour final. After that, for sure youfll be quite well trained. Then theyfre less enthusiastic about it. Itfs actually often difficult to recognize convex problems. Thatfs going to be a theme of the class. Let me distinguish a few things there. I would distinguish problems that come on a platter convex, the ones where you have to do some work and transform them and stuff like that. Let me move on. I want to do an example just to give a rough idea of what this looks like. For people who did 263, this will kind of tie into that. Herefs our problem. I have a surface here with patches, and I have some lamps here. Wefre going to choose the illuminating powers on the lamps. Thatfs going to be our variable. The illumination on a patch here is going to be the sum of the illumination from all the lamps here. Itfs going to be proportional to the lamp power and then the proportionality constant is going to be an inverse square law. R is the distance between the patch and the lamp. Therefll be something which is a shading effect, because obviously, if youfre coming in straight on here, then youfre going to get the full power. If this lamp, for example, puts very little illumination on here, and this were below its horizon, if there were a lamp here, it would not illuminate this patch at all. This is a simple thing. The problem is to achieve a desired illumination which is given. You want to do this with bounded lamp powers. I have a maximum power. Powers cannot be negative. I care on a log scale, because what I care about is missing the lamp power by two percent or twelve percent. I donft care about absolute. Itfs ratio compared to this one. The question is how would our solve it? Letfs take a look and see how you might solve it. If your question is do I shamelessly reuse material from one place in another, I can confirm that, yes. Are you asking have you seen that figure before? The answer could well be yes, you have. Letfs talk about how to solve it. The first thing you might do, and I would recommend this -- the first thing you do is you say letfs try some simple suboptimal schemes. One is just this -- you set all the lamps at the same power. You vary it and you plot this objective. You do a search over it. Therefs no dishonor in it. You do that. That might work, but it might not. Thatfs a good baseline design. You could say, well, I just learned least squares from 263. Youfre going to use least squares. Ifm going to do this. The objective here is not the sum of the squares. This is real life. This is the illumination engineers. Everyone uses the absolute value -- the sums of the absolute values of the law of percentage error. Ifm making it up, of course. You say well, good for you. We use the sum of the squares. You solve this problem. When you solve this problem, I guarantee you some of the Ps are going to come out negative. By the way, youfre going to do super well. Youfre going to get a nice, flat illumination pattern. What will happen is youfll look at the powers and a whole bunch of them will be negative. It turns out you can do really well in illumination with lamps that can spray negative power out. Thatfs not good. Then the heuristic. What do you do? Herefs what you do. You say if a P is bigger than the maximum power, I just set it equal to P max. If itfs a negative lamp, I turn that lamp off. Once again, you see how well you did, and you might do better than uniform -- maybe worse. I donft know. Now, this is what someone trained in least squares would do. Theyfd say not a problem. Theyfd go over here and say I want PJ to be in the interval zero P max. Therefore, Ifm going to add a regularization term which charges P for being away from the center of that interval. Everybody see what this is? You start with all the Ws one, and you solve this problem. Then, you just start weight twiddling. Youfll do quite well here. Youfll come up with some algorithm that twiddles the weights, and how you twiddle them is totally obvious. If P comes out outside that interval, that weight needs to go up in the next cycle. If P is timid because your weight is so high, you want to decrease that weight. A couple of cycles of this, and youfre going to get a very, very good solution. Unfortunately, you might also go on to then write an impossible to read 12-page paper with pages and pages describing your iteratively reweighted illumination thing. Hopefully, you wonft. You could also use linear programming. If you know about linear programming, you could actually solve this L one problem where you minimize an L one norm. This is closer than that. Linear programming -- it handles the inequalities. Therefs no twiddling there. This would probably do the best of all of them. The problem is convex, so this objective function -- it just looks like this. Itfs linear over here, and then itfs an inverse over here. You look at that function and you realize a couple of things. The important part -- thatfs what youfre going to be trained on in the next ten weeks is looking for convexity. This is what we like to see. By the way, you will see that that function is not differentiable. In a lot of other treatments of optimization, differentiability has this very high role, because a lot of things are based on gradiance and derivatives, and therefs no derivative there. So in convex optimization, differentiability is going to actually play a much lower role. In fact, itfs a non-role. This problem, even though it is non-differentiable here, it can be solved as fast as least squares if you know what youfre doing. We might even have you write from scratch a solver for it. We could also assign it. It would be four lines in a higher-level language or something like that to solve this. Thatfs it. This is just an example of a problem where itfs not obvious exactly how to solve this, and you can imagine a lot of people not knowing. Letfs look at a couple additional constraints. [Student:]Where did that curve come from? I plotted it. [Student:]I mean the equation. How did you know to do [inaudible]? If you go back and exponentiate my other objective, youfll find this. [Student:]So you just did it on both sides. Yes. If you minimize something, itfs the same as minimizing the X of that thing because X is monotone increasing. Letfs add some additional constraints here. Herefs one -- no more than half the total power is in any collection of ten lamps. That would be one. Another one is no more than half the lamps are on, but you get to choose which half. I wonft go into all the gory details, because for one thing, Ifm running out of time, but itfs not -- both of these sound complicated or easy or something like that. If you donft know about convexity, you wouldnft know the second one is an unbelievably difficult problem. In fact, youfd have to check all N [inaudible] N over two sets of half the lamps and for each one, solve the problem. Basically, everything would come down to that to actually get the real answer. You could get very good heuristics that would guess an answer, but they would not be certified as the actual correct one. This one -- no more than half the total power is in any ten lamps -- that actually is convex, and itfs not obvious. By week three of this class, you will know things like that. This is actually cool, because these are things that look very similar. If I said them the right way, I could probably make you think that they are kind of the same question. Ifll often do that in talks, except I donft give the answer, and then I pick some poor person. The point is these are just not obvious at all, these things. Why is it famous? Itfs not famous. [Student:]I mean, people have [inaudible] papers and have investigated [inaudible]. You mean the illumination problem? I think I probably made it up one day and then actually [inaudible] -- I can say this because hefs a colleague of mine at MIT. The next thing I knew, it was in his book, the famous illumination problem. Youfve been subjected to it in 263. That would be the 263 version of the illumination problem. Notice that we didnft ask you to find any powers, because you would have had this annoying problem of negative powers coming out. Thatfs selective, though. I donft think the problem is any more famous than itfs been here. I donft know. No, because this would subsume all -- you have to choose which half of them are actually going to be possibly on, not which are actually on. Itfs exponential in the end. Let me just say a little bit and then Ifll quit. The course goals and topics -- the goal is to allow you to recognize and formulate problems. You should know enough to develop code for these things. Youfll see itfs very simple. You might not think that about seven weeks into the quarter, but youfd be shocked at how simple some of these things are. Wefll focus a bit on the theory of this, and do want to skip forward to something at the very end here. I know Ifm over, so Ifll just take a minute or two to do this. Ifve already mentioned this. First of all, the theory of convex optimization is more than 100 years old. The first paper on this is from 1890 or something like that. In fact, the idea is traced earlier. By 1970, it was very well known, the theory. There are some points here that I think I donft have time to go into, but you can read about this in the first chapter. What I do want to say is something about why itfs interesting now, if this is 100 years old as a mathematical subfield. What makes it interesting now? What I can say is since about 15 years ago, people have been finding applications in lots and lots of areas, and it just sort of starts coming up. It may not particularly help you in some area to know that a problem is convex, but it sure doesn't hurt. It might in some cases allow you to solve it. This was sort of not the case, with the exception of least squares and linear programming. These have been widely used since the 50s and widely applied since the 1950s. Basically since the 1990s, there were a lot of things found in areas like machine learning and statistics and control and signal processing and stuff like that. You get a nice, positive feedback loop because once more people start discovering these problems and start writing papers on, for example, the famous illumination problem -- once they start appearing, what happens is people who write the algorithms see two of these and say somebody should write a code for the illumination problem. Then, they write a beautiful code, well tested, numerically stable, hopefully open source and all that, and then everyone can now solve the famous illumination problem. Only then do people realize that there never was such a problem, hopefully. I quit here, and Ifll see you, in theory in two days, but more like two months or something like that.