Hey, I guess wefll start. Let me just make a couple of announcements. I guess wefve combined the rewrite of the preliminary project proposal with the mid-term project review, and I think thatfs due this Friday. Good, okay, itfs due Friday. So thatfs going to be that. Please youfre welcome to show me or the TAs if you want a format scan. the TAs are now as nasty as I am, and they can scan something, just walk down it and say, eThatfs typesetting error. Ifm not going to read any farther.f Or theyfll read down a half-page and say, eWhat are the variables?f and then throw it at you with a sneer. So Ifve trained them pretty well. The danger, of course, is then they start applying that to the stuff I write, which has happened in the past. They say things like, eThis isnft consistent. Use this phrase on this page and this one on the other one.f And you look at the two, and you say, eYeah, thatfs true, all right.f The executive decision was made. Wefre going to switch around. Itfs not the natural order of the class, in fact, but it fits better with people who are doing projects. So a bunch of people are doing projects that involve non-convex problems, and so today we switched around, and wefre going to do sequential convex programming first, and then wefll switch back to problems that are convex problems and things like that. Wefll do large-scale stuff next. The good news is that we can actually pull this off, I think, in one day. So wefre going to come back later. Therefll be several other lectures on problems that are not convex and various methods. Therefs going to be a problem on reluxations. Wefre going to have a whole study of L1 type methods for sparse solutions. Those will come later. But this is really our first foray outside of convex optimization. So itfs a very simple, basic method. Therefs very, very little you can say about it theoretically, but thatfs fine. Itfs something that works quite well. Donft confuse it -- although itfs related to something called sequential quadratic programming. Thatfs something youfll hear about a lot if you go to Google and things like that. Those are things that would come up. But I just wanted to collect together some of the topics that come up. Okay, so letfs do sequential convex programming. Letfs see here. There we go. Okay, so I guess itfs sort of implicit for the entire last quarter and this one. I mean the whole point of using convex optimization methods on convex problems is you always get the global solution. I mean up to numerical accuracy. So marginal and numerical accuracy, you always get the global solution, and itfs always fast. And thatfs fast according to theorist. If you want a complexity theory, there are bounds that grow like polynomials and various things, problem size, log one over epsilon, which is the accuracy. And actually, in practice as well, these methods are very fast and extremely reliable. Okay, now for general non-convex problems, you really have to give up one of these. Either you give up always global or you give up always fast. And these are basically the big bifurcation. I mean there are things in between, but roughly, this is the idea. As long as you have local optimization methods, these are methods that are fast. Theyfre plenty fast. Theyfre just as fast as convex optimization methods, but they need not find -- they relax what they mean by a solution. So they find some kind of a local solution to a problem. It might not be the global one -- can well be the global one, but you wonft know it. Thatfs the local optimization. At the other extreme, you have global optimization methods, and wefll talk about these later in the class. These actually find the global solution, and they certify it. So when you solve the problem, when it stops, it basically says, eI have a point with this optimal value, and I can prove that Ifm not more than 1e -- 3 away from the solution period. Right? So now, in convex optimization, we do that very easily with the duality certificate. So you take a dual feasible point. So when you finish your problem, you say, eHerefs this point. Itfs feasible, and it achieves its objective value. Herefs a dual feasible point, which proves a lower bound, which is epsilon away from the objective of your feasible point, and thatfs how you prove it. For non-convex problems, the proofs, the certificates as wefll see, are bigger and longer. Wefll see what they are. Okay, now what wefll talk about here is a specific class of local optimization methods, and theyfre based on solving a sequence of convex programs. It should be understood, but the semantics of everything wefre going to do today is. Wefre not solving any problems at all. When we esolvef a problem. Ifll stop saying quote, unquote, but youfll hear it. Ifll aspirate the q and theyfll be an air puff after the t, and thatfs the quotes. So it has to be understood; when we say we esolvef this problem or eherefs the result we get,f these are not the solutions of the problems, as far as we know. They might be, and they are pretty good. But itfs to be understood that we are now in the world of non-convex optimization, and there may be bounds you get, but thatfs a separate story. Sequential convex programming goes like this. Itfs going to solve a sequence of convex problems, right. And in fact it fits in this big hierarchy, right, where if someone says to you, eHow do you solve this convex problem?f Itfs actually a sequence of reductions, and the reductions go like this. I just want to put this whole thing in this main hierarchy. Reduction goes like this. Someone says, eWell how do you solve this problem with inequality constraints and all that kind of stuff?f And you say, eNo problem. I use a barrier method, which basically reduces the solution of that problem with constraints to a smooth convex minimization problem.f And some one says, eWell, yeah, how many of those do you have to solve?f And you say, f20.f And they say, eReally 20?f And you go, eOkay, sometimes 40.f Thatfs the right way to say it. Then itfs, eHow do you solve a smooth minimization problem?f And you say, eWell, I solve each of those by solving a sequence of quadratic convex problems.f So thatfs how that works. Thatfs exactly what Newtonfs Method is. And someone says, eYeah, how many of those do you have to solve?f And you go, eI donft know, 5 each time,f or something roughly. And then you say, eHow do you solve a quadratic minimization problem with equality constraints?f The answer is, eI solve linear equations.f If they keep going, then you can explain all the methods about exploiting structure in linear equations. So anyway, you can see that each level is based on reducing the solution of that problem to solving a sequence of the ones below it. And ultimately, it all comes down to linear equations, and in fact, if you profile any code -- almost any code -- all itfs doing, solving linear equations, nothing else. Okay? So people who work on numerical linear algebra are always very happy to point that out. Okay, so this fits on top of that hierarchy. So you have a problem thatfs not convex and you want to esolvef it. And you do that by reducing it to a sequence of convex programs. Okay, all right. Now the advantage of this is that the convex portions of the problem are handled exactly and efficiently because in a lot of problems -- and wefll see just from our baby examples -- a lot of the problem is convex. Huge parts of the objective are convex, lots of constraints; I mean just very typical constraints, just upper and lower bounds on variables. So a lot of these are going to be convex. And those are going to be handled exactly. Thatfs the good news. On the other hand, Ifve already said this a couple of times. We have to understand. This is a euristic. It can fail to find even a feasible point, even if one exists. Even if it finds a point and does something, therefs absolutely no reason to believe itfs the global optimum. So I think Ifve said that enough times, but in the context of this course, itfs worth repeating it a bunch of times. Now, in fact the results also can and often do depend on the starting point. You can even look at that from the half-empty or half-full glass approach. The half-empty glass approach says, eWell it shows you this is all stupid and euristics. If it depends on where you started from, then why should I trust you?f I sort of agree with that. Herefs the half full. The half-full approach says, eOh, no problem. Thatfs fantastic. Wefll run the algorithm many times from different starting points, and wefll see where it converges to.f If it always converges to the same point and the person wefre defending this to isnft too snappy, we can say, eOh, we think this is the global solution because I started my method from many different points, and it always came to the same thing.f If youfre an optimist again, if it converges to many different points, you say, eNo problem. Ifll run it 10 times. Ifll take the best solution found in my 10 runs, and therefore, you see itfs an advantage. Itfs a feature that you do this.f Now actually, these methods often work really well. And that means it finds a feasible point with good -- if not optimal, actually often it is optimal, you donft know it -- point. Thatfs the background. So wefre going to consider a non-convex problem, standard optimization problem, and you can have equality constraints that are not affine, and you can have inequality constraints and so on. And the basic idea of sequential convex program is extremely simple. I mean itfs very simple. It goes like this. At each point, youfre going to maintain an estimate of the solution. So wefre going to call that Xk. The superscript K is going to denote the iteration counter. So you can have a counter, which is the iteration. And whatfs going to happen is youfre going to maintain something called a etrust region.f And a trust region is -- Ifll get to what it is in a minute, or wefll see how it acts. Itfs basically a little region, which includes Xk. So it surrounds Xk, and itfs the region over which you propose to find a slightly better solution. Thatfs the idea. So if the trust region is small, it says youfre only looking very locally, and Xk + 1 is going to be local. If itfs bigger, then it means youfre willing to take bigger steps. So thatfs the idea. So herefs what you do. Youfre given a trust region, and each inequality function and your objective; you ask to form a convex approximation of FI over the trust region. And wefll see. Therefs going to be lots of ways to do that, some simple, some complex, and so on. And then for the equality constraints, youfre going to ask each possibly non-affine constraint here to generate an affine approximation of itself. Now obviously, if the objective or a bunch of the inequalities are convex, a perfectly good approximation of itself is itself. Right? So in that case, thatfs very -- so forming a convex approximation of a convex function is very easy, same for affine. So I simply replace all of the objective, inequality constraint functions, and the equality constraint functions what their appropriate approximations. And I impose this last condition, which is the trust region constraint. And now I solve this problem. Thatfs a convex problem here. So thatfs the idea. And this is not a real constraint in the problem, obviously, because TK is made up by us. So itfs not a real constraint. This is here to make sure that these approximations are still roughly speaking valid. Thatfs the idea. So the trust region, a typical thing, although this is by no means -- by the way, this is really a set of ideas. So donft -- the algorithm we show here is going to be simplified. When you actually do these things, depending on what your particular problem is, youfll switch all sorts of things around. Youfll use different approximations, different trust regions. It all depends on the problem. So itfs really more of an idea. A typical one would be a box. Now, if a variable appears only in convex inequalities and affine equalities, then you can take that row phi to the infinity because you donft need to limit a variable that appears only in convex equalities and inequalities. Convex equality, by that of course I mean an affine equality. How do you get these approximations? Well, letfs see. The most obvious is to go back to the 18th century and look at things like Taylor expansions. Itfs the most obvious thing. In fact, thatfs all of calculus. Thatfs what calculus is. Itfs just nothing but an answer to the question, eHow do you get an affine approximation?f Roughly, right? eHow do you get an affine approximation of a function in an organized way?f And the answer is you torture children for 14 years memorizing stupid formulas long after -- centuries after anyone has remembered what any of it is for. So thatfs the answer to that question, how do you get that approximation. So herefs your basic 18th century approximation. If you want a convex approximation, then therefs something interesting you can do here. Of course, the second order Taylor expansion would have the Heschen right here. Thatfs this. But if the Heschen is indefinite and you want a convex approximation, you can simply drop the negative part of a matrix. By the way, how do you get the positive and negative parts of a matrix? In this case, itfs just talking about expansion. Itfs the SV, but not quite. Itfs the ERC iGAn expansion. So you take the IGAn value expansion. You set the negative IGAn values to 0, and thatfs the projection onto -- thatfs what P is, this PSD part. Now, these are very interesting. These are local approximations right. I mean thatfs the whole point of calculus right. Thatfs this thing. And they donft depend on the trust region radii. But thatfs kind of the idea. So calculus is vague, and it works like this. If you say, eYeah, how good an approximation is this?f You say, eOh, yeah. Itfs very good.f And you say, eWell, what does that mean?f Well, very good means if this is small, then the difference between this and this is small squared. You say, eYeah, but which constant in front of the small?f You say, eI donft know. It depends on the problem and all.f Itfs vague. It basically has to do with limits and all that kind of stuff. Actually, Ifll show you something more useful, and this is more modern. And in my opinion, this is the correct way. So not to knock calculus or anything, but this is. Actually, the reason I like to emphasize all that is because all of you have been brainwashed for so long, and especially when you were very young calculating derivatives. So that if someone says affine approximation, whatfs the little part of your brain right in the center that controls breathing and stuff like that? This comes out just out of that part, depending on when and how long you were tortured learning how to differentiate T2 sin T. Okay, so thatfs why I like to emphasize that therefs new methods. Theyfre relatively modern. They came up in estimation first and other areas. Theyfre called particle methods for a bunch of reasons because they came up first in estimation. Ifll show you what they are. Herefs a particle method. This is the modern way to form and affine or a convex approximation, and it goes like this. What you do is you choose some points. There are problems with this, and Ifll talk about that in a minute. Ifm not saying throw the calculus out. But what you do is you choose a bunch of points in the trust region, and therefs a big literature on this. Itfs not even a literature because they donft really say anything. Itfs like a lore, and itfs like a craft actually. People even talk. Ifve heard people say things like, eWhatfd you use?f eOh, I used the sigma points or something.f Whatfs a sigma point or something? And itfs some euristic for determining some points or something. Anyway, so here are the types of things you could use. You could use all vertices, depends on the dimension of the problem. If the dimension of the problem is like 5, all vertices are 32 points. You might do 3. You could do the center and all vertices and stuff because you want a sample of function in a little box. This is actually much better done when each function, even if the terminal variables are large, each function only involves a couple of variables. Thatfs what youfre hoping for. Other ones, you can do some vertices. You can use a grid. You can generate random points. Therefs a whole lore of this. And you simply evaluate your function, and now you have a pile of data that looks like this. Itfs the argument and the value of the function. And what you do is, you have a pile of these and you fit these whether convex or affine function depending on what was asked. So thatfs the picture. Of course, itfs going to come up. Thatfs a convex optimization problem as well, not surprisingly. There are a lot of advantages of this. One is that it actually works pretty well with non-differentiable functions or functions for which evaluating the derivatives are difficult, right. So for example, if you want to do an optimal control problem with some vehicle or something like that, some air vehicle or whatever, and someone says eherefs my simulator,f and itfs some giant pile of code. I guarantee you it will have all sorts of look-up tables, horrible polynomial approximations and things like that. Youfll look deep, deep, deep into this differential equation or something, and no one will even know what it is, some function that calculates the lift or the drag as a function of angle of attack and dynamic pressure or something like that, and itfs going to be a look-up table obtained from wind tunnel tests. Thatfs what itfs going to be. If itfs anything else, itfs because someone else fit it for you, but you shouldnft let someone else fit something for you because they didnft care about convexity, and you do. So thatfs why you shouldnft just let them fit things. The joke is actually a lot of times people fit things, and by accident, the fits are convex. Often that happens. So this works really well especially for functions for which evaluating derivatives is difficult or given by look-up tables and all that kind of stuff. Now, when you get a model this way, itfs actually very interesting. You shouldnft call it a local model. A local model basically refers to calculus. If someone says, eHerefs my local model.f And you say, eWell how accurate is it?f You can say, eExtremely accurate provided youfre near the point around which itfs developed.f So itfs kind of this vague answer. This one is not a global model. A global model says, eNo, thatfs the power.f Thatfs an accurate expression for the power over three orders of magnitude of these things. Thatfs the power. Ifm telling you. Thatfs a global model. I call these regional models because it depends on the region. And so your model actually -- let me just draw some pictures here. We should have done this for the lecture, but oh well. We just wrote these. Letfs draw a picture. Could you raise this just a few inches here? I think wefd be better off. Cool, thatfs good. Even better, thatfs perfect. Herefs some function that we want to fit. It doesnft even really matter. So if you take this point and someone says, eMake me an affine model.f Then of course, the calculus returns that approximation, right. So a regional model would do something like this. In a regional model, you have to say what the region is, and the model is going to depend on the region asked for. So if you say, eWould you mind making me a regional model over -- thatfs too small, sorry. Letfs make it here to here. So we want to make a regional model over that range. I donft know what itfs going to be, but itfs going to look something like that, okay. Now, herefs the cool part. It doesnft even have to go through that point, okay. And you can see now that youfre going to get much better results with this, in terms of getting a better point. Now, it also means that those trust regions need to tighten before you can make some claim about high accuracy or whatever. But thatfs the idea. Is this clear? I think these are totally obvious but really important ideas. So how do you fit an affine or quadratic function data? Well, actually affine is 263. So itfs least squares. I mean in the simplest case itfs 263. So affine model is just least squares, and Ifm not even going to discuss it. By the way, you donft even have to do least squares. Once you know about convex optimization, you can make it anything you like. So if you want to do mini-max fit, if you want to allow a few weird outliers, throw in an L1 Huber fit or something. So you know what to do. If you donft care about accuracy, if an error of +/- 0.1 doesnft bother you and then you start getting irritated, put in a dead zone. All this goes without saying, so use everything you want to use. I should add that every iteration in every function appearing in your problem, youfre going to be doing this. So is that going to be slow? Yes, if itfs written in mad lab, right. But if itfs done properly, youfre solving extremely small convex problems. If this is done correctly, if itfs written in C, these things will just fly. These are sub-millisecond problems here. Thatfs also not a problem unless you do it in some very high level interpretive thing. This is an example of showing how you fit a convex quadratic. And that you would do this way. Itfs an SCP because you have a positive semi-definite constraint here, and then this here is, of course, the convex that youfre fitting. Your variables are P, Q, and R. These are data, and this objective here is, of course, convex quadratic in the data. So thatfs a least squares problem with a semi-definite restraint. Now, another method, which we will see shortly, is quasi-linearization. You can certainly say itfs a cheap and simple method for affine approximation. To be honest with you, I donft know why -- I canft think of anything good about it except it appeals to one, the peoplefs laziness. I canft think of any other good reason to do this except laziness. We did it later in an example, but here it is. Itfs kind of dumb. It basically says this. You write your non-affine function as ax + b, but you allow a and b to vary with x. Now thatfs kind of stupid because one example is just to take this 0 and say that b is h. As you can see right out of the gate here, this method is not looking too sophisticated. Itfs not. Itfs not a sophisticated method. But then what you do is you say, eWell look, if x hasnft changed much, Ifll just replace a of x. Ifll just use the previous value, and b will be the previous value here.f So this is like even dumber than calculus because this isnft even a local approximation. This is not a good approximation, but itfs often good enough. So herefs an example. Herefs a quadratic function. Ifm going to rewrite it, many ways to do this, but Ifll rewrite it this way as ax + b. So if I quasi-linearize it, I simply take the last version here. Whereas, the Taylor approximation, the correct way to do this is to take this thing out, and these are not the same if you multiply them out. Itfs quite weird, but theyfre not the same. Letfs do an example. Our example is going to be a non-convex quadratic program. So wefre going to minimize this quadratic over the unit -- by the way, itfs a famous NP hard problem here. It doesnft matter. Itfs a famous hard problem. Here P is symmetric but not positive semi-definite. By the way, if P is positive semi-definite, this is a convex problem, and itfs completely trivial, and you get the global solution. So itfs only interesting if P has a couple of negative iGAn values, one will do the trick. So wefre going to use the following approximation. This is going to be the Taylor expansion, but we truncate, we pull out, we remove the negative part of P because that would contribute a negative curvature component. So herefs an example in R20. We run SCP, the sequential convex programming, where the trust region rate is a 0.2. Now the whole, all the action goes down in a box +/- 1. So the range of each x is from -1 to 1, which is a distance of 2. So this thing basically says that in each step, you can only travel 10 percent of your total range. So in fact, in the worst case, it will take you at least 10 steps just to move across the range here, to move from one corner of the box to the other of the feasible set. You start from 10 different points, and here are the runs. This is the iteration number. Thatfs maybe 20 or something. And you can see that basically by or around 25 steps these things have converged. And you can see indeed they converge to very different things. They converge to different. So here, if we were to run 10 things, that would be the best one we got. Itfs around -59 or something. Itfs nearly -60. Thatfs this thing. Now, we donft know anything about where the optimal value is. This is a lower bound. Ifll show you how to derive that in just a second, but thatfs a lower bound from Lagrange Duality. But in fact, what we can say at this point is that we have absolutely no idea where the optimum value is, except itfs in this interval. Itfs between -59 and whatever this thing is, -66.5. So therefs a range in there, and we have absolutely no idea where it is. My guess is itfs probably closer to this thing than this one, but you know, itfll be somewhere in there. You want a better lower bound? Wait until the end of the class, and wefll do branch and bound and wefll show you how to bring the gap to 0 at the cost of computation, and the computation will be a totally different order of magnitude. It wonft be 25 convex programs. It will be 3,000 or something like that, but youfll get the answer, and youfll know it. So thatfs it. Let me just show you how this lower bound comes up. You know about this. Itfs just Lagrange Duality. So you write down the box constraints as xi2<1, and you form a Legrangian, which looks like this. This you can minimize. This is a quadratic for any in x, and if you minimize it, this is roughly what it is. There are some stupid boundary conditions, which donft matter, and anyway, you get the same answer. This matrix here has to be positive definite. Thatfs actually not quite true because they can be positive semi-definite and some other condition hold, but ignoring that boundary. If this matrix here has a negative iGAn value, then for sure the infimum of this quadratic is -8. Thatfs for sure. Now, if itfs positive though, you can just set the gradient equal to 0 and you evaluate it, and you get this expression here. And you can see this expression is, as has to be, this expression is concave because we know G is always concave. We know that by the way, the dual of a non-convex problem, the LaGrange dual is convex. Therefore, again roughly speaking, itfs tractable. We can solve it, and when you solve it, you get a lower bound on the original problem. So this is the dual problem, something. And this is easy to solve. You can convert it to a SCP or why bother, let CBX do it for you because that function exists there. So itfs literally like one or two lines of CBX to write this down. And then you get a lower bound. And if you try it on this problem instance, you get this number here. And in a lot of cases actually, when these things are used, you donft even attempt to get a lower value. When youfre doing things like sequential convex programming in general, youfre doing it in a different context. So you donft have a lot of academics around asking you, eIs that the absolute best you can do? Can you prove it?f and stuff like that. Youfre just basically saying, eCan I get a good trajectory? Can I get a good design? Does the image look good?f or something like that. Or, eDo I have a good design?f or whatever.f Okay. Thatfs it. Now you know what sequential convex programming is. But it turns out to actually make a lot of it work, there are a few details. Wefre going to go over the level 0 details. Therefs level 1 ones as well and level 2, but wefll just look at the level 0. These are the ones youfre going to run into immediately if you do this stuff. By the way, how many people here are doing project that involve non-convexities or something like that? Okay, so a fair number. These are the highest-level ones that you have to know about immediately. The first is this. You can form this convex approximately problem. So letfs go back to what it actually is. There you go. You can form this thing, but what if itfs not feasible? Which could happen. Basically, you start with a point thatfs way off. I mean way, way in the wrong neighborhood, just totally off, and you have a small trust region. So you canft move very far. That means probably this is not going to be feasible. So right now, if you were to run this, depending on the system, you run, for example CVX, it will simply come back and tell you itfs infeasible. Instead of returning you an approximate x, it will return a bunch of nads. It will return you some dual information certifying that the problem you passed in was infeasible. So thatfs just in case you didnft trust it. It will return you a certificate thatfs not interesting. So what youfre really going to have to deal with is what you really want to do is you want to make progress. So thatfs the first thing. Now the other issue is the following: even if this problem is feasible and you step, how do you know youfre making progress for the main problem, right. So to evaluate progress, youfre going to have to take into account a bunch of these. First of all the objective, thatfs the true objective not the approximate objective. So obviously, you want this to go down, and if this point were feasible, wefll see cases where that has to happen. But if this point were feasible, then itfs easy how to measure progress, by the objective because thatfs the semantics of the problem. The semantics of the problem is literally if you have a two feasible points and you want to compare them, the semantics of the problem is if one has a better objective, itfs better, end of story. Okay. So thatfs how you do that. The problem is what if the approximate problem is infeasible, what if the exact problem is infeasible. You have to have some measure of something it tells you about making progress. So you might say that making progress has something to -- you would like these violations to go down. That would be one -- thatfs sort of one measure of making progress. This is like having a Liaponal function. There are lots of other names for these things. Merit function is another name used to describe a scalar valued function that tells you whether or not youfre making progress. Thatfs a standard and coherent method. Now, if youfre not making progress, it could be for lots of reasons. It could be that your trust region is too big. And so whatfs happening is your function is wildly non-linear over the trust region. Youfre least squares thing is providing some least squares fit, but youfre making horrible violations. By the way, when you call the method on the functions that says return an affine approximation, it can also return the errors, right. And if the errors are huge, therefs probably no point in forming and solving the convex program. Ifm talking about a more sophisticated method. Instead, the collar will actually say, eOh, thatfs too big reduce your trust region by a factor of 2 and try it again,f or something like that. One reason is that your trust region is too big. So youfre making poor approximations. And then what happens, you form a linearized system, and in the linearized system, you have to make progress because itfs a convex problem. Unless it just says youfre optimal at that point. It will suggest a new point, which will make progress. Any way you form a measure of progress that is convex, it will make progress period. But therefs no reason to believe that true non-linear problem is. Now on the other hand, if the trust region is too small several things happen. One is the approximations are good because youfre basically taking little mouse steps in x. Youfre approximations are quite good. But the progress is slow. Youfre going to solve a lot of convex programs. And therefs one other problem, and this is a bit weird, but itfs this. You are now more easily trapped in a local minimum because youfre just going to basically slide downhill and arrive at one point. If youfre trust region is big at first and your function has all sorts of wiggles in it, since youfre looking down on your function from high up and getting a very crude approximation, youfre actually going to do better if you have really tiny small rows, youfre going to get caught in the first local minimum. As youfre figuring, out all of this non-convex optimization is some combination of art, euristics, and other stuff. [Student:][Inaudible] for the row then? Yes, wefre going to get to that. There is, yes. Ifll show you actually what would be the simplest thing that might work. And the truth is none of this works. I mean in the true sense of the word. But if you put quotes around it, then I can tell you what eworks,f and that means something comes out and you get something thatfs reasonable, close to feasible. Is it optimal? Who knows? But you get a nice picture, a nice trajectory, a nice estimate of some parameters or whatever it is you want. So a very simple way of constructing a merit function is this. This is the true objective not the approximated one. To the true objective, you add a positive multiple of -- this is the violation. Thatfs the total, true violation, and this is the inequality constraint violation. Thatfs the equality constraint violation. Something very important here, these are not squared. Theyfre not squared, and let me say a little bit about that. They would be squared if the context of what you were doing was conventional optimization where differentiability is a big deal. They would be squared. And someone would say, eWell why are you squaring hi, and you day because now itfs differentiable. But if you know about convex optimization, you know that non-differentiability is nothing to be afraid of. Thatfs why we do this. But therefs also something else very important about this. It turns out; this is whatfs called an exact penalty function. So this is by the way called a penalty method. Itfs the opposite of a barrier method. A barrier method forces you to stay in the interior of the feasible set by adding something that goes to 8 as you approach the boundary. So you donft even dare get near the boundary. Well, you will eventually get near the boundary. Your level of daring is going to depend on how small this is. In the barrier method, by the way, some people write it here, or we would put it over here with the T and make T get big, but itfs the same thing. Thatfs the barrier method. A penalty method is the opposite of barrier method. It allows you to wander outside of the feasible set, but itfs going to charge you, and this is the charge. So thatfs a penalty. Now, this is whatfs called an exact penalty method, and let me tell you what that means. It means the following. It means that -- itfs kind of intuitive, and it can be shown that if you increase lambda here, the penalty for being outside the set, youfre going to violate the constraints less and less. Itfs obvious, and you can show this. But herefs the cool part. For some penalties, the following occurs. For lambda bigger than some lambda critical, some finite critical value, the solution, the minimum of this thing, is exactly the solution of the problem. Itfs not close. Itfs not like, oh, if I charge you $50.00 per violation unit, you get close, and then 80 you get closer, but youfre still violating a little bit. Then I say, eOkay, now itfs a thousand dollars per violation unit, and you violate 1e -- 3. No, it works like this. Itfs $50.00 per violation unit, you violate 80, and I get it up to 200, and your violation is 0, and you have therefore solved exactly the original problem. Everybody see what I am saying here? By the way, itfs very easy to show this for convex problems. This is non-convex, and in any case, itfs irrelevant because nobody can minimize this exactly. So all of this is a euristic on top of a euristic on top of a euristic, but anyway. So thatfs an exact penalty method. By the way, this is related to L1 type things. Youfve seen this there. If you add an L1 penalty, we did homework on it didnft we? Yeah. So if you have an L1 penalty and you crank up the lambda big enough, itfs not that the x gets small. You go over a critical value, and the x get small, but for a finite value lambda, itfs 0. Itfs just 0 period. So this is the same story. Now, we canft solve this non-convex problem. No easier or harder than the original problem, so instead, wefll use sequential convex programming and wefll simply minimize this problem. Now the cool part about that is you canft be infeasible here. I mean assuming the domains of all the Fs and Hs are everywhere, which is not always the case, but roughly speaking, you cannot be infeasible now. You can plug in any old x. You can take a tight trust region method. If you move just a little bit, the hope is that this violation will go down, right. But the point is anything is possible. You can violate constraints, equality and inequality constraints. You just pay for them. Thatfs what this is. So that deals with the feasibility thing, and in fact, a lot of people just call this a phase 1 method in fact. So itfs a good method. I should add -- Ifm going to ask you a couple of questions about it just for fun to see how well you can do street reasoning on convex optimization. So here it is. When you minimize this thatfs a convex problem, and herefs what I want to know. Ifve already told you one fact. If lambda is big enough all of these will be less than or equal to 0, and all of these will be exactly 0. Itfs just like L1. So let me ask you this. When lambda is not big enough and you solve these, and some of these are non-zero, what do you expect to see? So lambda is not big enough to cause all of these to be less than or equal to zero and all of these to be zero. What do you expect to see, just very roughly what? [Student:]A small violation for most of the Fifs but just a few phi hanging around then. Thatfs a good guess. So a small violation for all of these guys and a few -- now why did you say a few here? [Student:]L1. Cool, L1 exactly because for example, if these are affine, and this really is, thatfs really an L1 norm. And so L1 and that part of your brain is next to your sparsity part. I mean the neurons are growing between the sparsity and the L1 thing. Thatfs a very good guess. Actually, Ifll tell you exactly what happens. What happens is a whole bunch of these will be 0, and youfll have sparse violations. That comes directly from L1, and you get exactly the same thing here. So itfs actually very cool. Youfll have a problem. This is good to know just out of this context just for engineering design. You have a convex problem. You have a whole bunch of constraints. Itfs infeasible. T hatfs irritating. At that point, one option, if you say sorry itfs just not feasible. Itfs not part of your method if itfs convex optimizations because there is no feasible solution. Not therefs one and you failed to find it. So if you minimize something like this, what will happen is really cool. If you have, letfs say, 150 constraints, 100 of these and 50 of these. Herefs what will happen if youfre lucky. This thing will come back, and it will say, eWell, Ifve got good news and bad news. The bad news is itfs not feasible. There is no feasible point. The good news is of your 50 equality constraints, I satisfied 48 of them, and of your 100 inequality constraints, I satisfied 85 of them. Everybody see what Ifm saying? So this is actually a euristic for satisfying as many constraints as you can. By the way, thatfs just a weird aside. It doesnft really have to do with this particular problem, but itfs a good thing to mention. The question was; how do you update the trust region. Letfs look at how this works. So herefs whatfs going to happen. Ifm going to solve a convex problem. T hat will suggest and xk + 1. I will then simply evaluate this thing. This might have gone down. These might have gone up. Who knows? The only thing I care about is this phi, and the question is if I went down I made progress, and roughly speaking, I should accept the move. If phi doesnft go down, if it went up, that was not progress, and it could be not progress for many reasons. It means that your trust region was too big, you had very bad errors, you solved the approximate problem, it wasnft close enough. It gave you bad advice basically. So herefs a typical trust region update. By the way, this goes back into the e60s. Itfs related to various trust region methods, not in this context but the same idea. Itfs identical to back in the f60s. So what happens is this. You look at the decrease in the convex problem. This you get when you solve the convex problem. And it basically says; if those approximations you handed me were exact, you would decrease phi by some amount delta half. This is always positive here, always. It could be 0, which means basically that according to the local model youfre optimal. Therefs nowhere to go. Okay? But this is positive. This predicts a decrease in phi. Then you actually simply calculate the actual exact decrease. Now, if your model of phi, if this is very close to that, these two numbers are about the same. So if your approximations are accurate, these two numbers are the same. By the way, that generally says your trust region is too small, and you should be more aggressive. So if in fact your actual objective is some fraction alpha, typically like 10 percent. If you actually get at least 10 percent of the predicted decrease, then you accept this x ~, thatfs your next point. And you actually crank your trust region up by some success factor. By the way, this is just one method. There are many others. Thatfs the nice part about euristics. Once you get into euristics, it allows a lot of room for personal expression. So you can make up your own algorithm and itfs fun. I guess. Notice that in convex optimization therefs not much room for personal expression if you think about it carefully right. You really canft say use my LP code because itfs better. eWhat do you mean?f eOh, youfre going to get way better results with mine, way better.f Because you canft because any LP solver thatfs worth anything gets the global solution. So theyfre always the same. I mean you can have a second order fights about which one is faster and slower and all that, but thatfs another story, but you certainly canft talk about quality of solution because they all get the exact global. In non-convex, you can actually talk about the quality. You can actually stand up and say, eMy algorithm is better than yours.f And it will mean actually that Ifll sometimes get better solutions. Here you increase it. Typically, you might increase it 10 percent or something like this. Now if the actual decrease is less than 10 percent of the predicted decrease, then what t says is you better crank your trust regions down. The typical method is divide by =2. This is just a typical thing. As I said, therefs lots of room for personal expression, and you can try all sorts of things. So this is a typical thing to do. By the way, one terrible possibility is that delta is negative. Now if delta is negative, it means this thing -- it says I predict you will decrease phi by 0.1, and then when you actually calculate this, itfs entirely possible that phi not only did not go down, it went up. Thatfs definitely not progress. But thatfs the way to do this. This is a trust region update. Like I say, these are things that are easily 40 years old, 50. I mean not in this context, but in the context of other problems. So now, wefre going to look at an example. Itfs an optimal control problem. Actually, I know we have a couple of people working on some of these. And so itfs for a two-link robot, and itfs not vertical. Itfs horizontal because therefs no gravity here. That doesnft matter. Itfs just to kind of show how it is. So therefs two links here, and our inputs are going to be a shoulder torque and an elbow torque. So these are the two inputs wefre going to apply t this, and wefre going to want to move this around from one configuration to another or something like that. Wefll get to that. So herefs the dynamics. The details of this not only donft matter, but therefs a reasonable probability, maybe 5 percent, that theyfre wrong because we donft know mechanics, but somebody who knows mechanics can correct us if itfs wrong. Anyway, so itfs an inertia matrix multiplied by the angular acceleration, and thatfs equal to the applied torque. Thatfs a two vector. Thatfs towel one and tab two. Here is allegedly the mass matrix. I can tell you this. It has the right physical units. So that part we can certify. Itfs probably right. Now the main thing about these is that m of theta and w of theta are horribly non-linear. They involve product of sines and cosines of angles and things like that. So itfs not pretty. So this is a non-linear DAE, differential algebraic equation. Of course, to simulate itfs nothing to do because we know how to numerically simulate a non-linear DAE. Itfs nothing. Letfs look at the optimal control problem. Wefre going to minimize sum of the squares of the torques, and wefll start from one position, at rest. Thatfs a two-vector, and we want to go to a target position, again, at rest. Wefll have a limit on the torque that you can apply. So therefs a maximum torque you can apply. This is an infinite dimensional problem because tau is a function here, but letfs look at some of the parts. Thatfs at least convex functional. These are all cool. This is cool too. So the only part that is un-cool is the dynamics equality constraints, which is non-linear, right. If it were linear, this would be not a problem. So wefll discortize it. Wefll take the N times steps, and wefll write the objective as something like approximately that, and wefll write the derivatives as symmetric derivatives. By the way, once you get the idea of this, you can figure out how to do this in much more sophisticated ways where you use fancier approximations of derivatives and all that, but this doesnft matter. So wefll approximate the angular velocities this way, and wefll get the angular accelerations from this. Thatfs approximately, and those initial conditions correspond to something like this. For two steps, youfre at the initial condition, and for the final two steps, youfre at the final one. The two-step says that your velocity is 0. That guarantees the velocity being 0, and wefll approximate the dynamics this way. Now, thatfs a set of horrible non-linear equations because remember this mass matrix. If this were a fixed matrix and if this were fixed, this would be a set of linear equations. So itfs written like that, and the first and lazy thing to do is quasi-linearization. Thatfs kind of the obvious thing to do there. So letfs do that. Herefs our non-linear optimal control, and by the way, this kind of give you the rough idea of why things like sequential convex programming make sense. Thatfs convex. These are convex. These are all convex. The dynamics are not convex. Thatfs the problem. Wefll use quasi-linearized versions. Youfd get much better results if you actually used, I believe -- we havenft done it, if you used linearized versions. And if you used trust region versions, youfd probably get even better still, but in any case, wefll just simply -- and by the way, the physical meaning of this is very interesting. The physical meaning is this. This is a set of linear equality constraints on theta. But it says itfs not quite right. It basically says itfs the dynamics, but youfre using the mass matrix and the coriolis matrix from where the arm was on the last iteration because thatfs what this is. Now, hopefully in the end, itfs going to converge. And the new theta is going to be close to the old theta, in which case, youfre converging into physics consistency or something like that. Thatfs the idea. Wefll initialize with the following. Wefll simply take the thetas. We want to go from an initial point to a target point, and wefll just draw this straight line. By the way, therefs no reason to believe such a line is achievable. Actually, just with the dynamics, therefs no reason to believe itfs achievable and certainly not with torques that respect the limits. I mean this is no reason to believe that. Okay, so this is the idea. So wefll do the numerical example now with a bunch of parameters and things like that, and this says that the arm is start like this, and then with this other one bent that way or something like that. And then, thatfs going to swing around like this, and the other one will rotate all around. Therefs a movie of this, but I donft think Ifm going to get to it today. Anyway, itfs not that exciting unless your eye is very, very good at tracking errors in Newtonian dynamics. There are probably people who could do that. They could look at it and go, eOw, ooh, ahh.f And you go, eWhat is it?f They go, eYoufre totally off.f And then at the end, they could look at it and go, eYep, thatfs the dynamics.f So the various parameters are; wefll bump up the trust region by 10 percent every time we accept a step. Wefll divide by = 2 if we fail. The first trust region is going to be huge. Itfs going to be +/- 90 degrees on the angles. All the actionfs going to go down over like +/- p or +/- 180. And then, we take lambda = 2. That was apparently large enough to do the trick. So the first thing you look at is the progress, and remember what this is. This combines both the objective and then a penalty for violating physics. You have a penalty for violating just physics, and in fact, that second term, Ifll say what it is in a minute. But that second term, the equality constraints are in Newton-meters. At every time step, itfs two equality constraints on two torques. The residual is in Newton-meters. Itfs basically how much divine intervention you need to make this trajectory actually happen. Itfs the torque residual. Itfs how much unexplained torque there is in it. So this says yes, itfs progressing. By the way, this doesnft go to 0 of course because it contains that first term. Wefll see how that looks. So herefs the convergence of the actual objective and the torque residuals. So letfs see how that works. You start off with an objective of 11. Thatfs great. In a few steps though, itfs up to 14. You think thatfs not progress. The point about this 11 is yes, thatfs a great objective value. Therefs only one minor problem. Youfre not remotely feasible, right. So itfs a fantastic objective value. Itfs just not relevant because youfre not feasible. Presumably what happens, in fact we can even track through this and figure out exactly what happens. So the first two steps our trust region was too big. We were proposed a new thing. We evaluated the progress. We didnft make enough progress. Wefll see exactly what happened. We divide by 2 twice. So in fact, it was +/- 90, +/- 45, +/- 22.5. At that point, that trust region was small enough that we actually got a step to decrease things. And what it did was since phi goes down, this thing went way down, and yet, your objective went up. What that strongly suggests is that your torque residual went down, and in fact, it did. This is on a log plot, so you canft see it, but that little bump there is a big decrease in torque residual. You can see here. This is the torque residual going down like that. So when we finish here after 40 steps, we have something that -- surely by the way, this is much smaller than the error we made by discortizing. So therefs an argument that we should have stopped somewhere around here, but itfs just to show how this works. You can guess that this is the value of the objective that we get in the end. You can see that by 20 steps you actually had a reasonable approximation of this. So this can be interpreted as the following. This is making progress towards feasibility. Youfve got rough feasibility by about 15 or 20 steps, and this is just keeping feasibility and fine-tuning the objective. What you see here are pictures of the actual and predicted decrease in phi, the blended objective. And you can see here. On the first step, you predicted this decrease. You predicted a decrease of 50, but in fact, when you actually checked, what happened? Not only did that phi not go down, it actually went up. Thatfs negative. It went up by 5. So you reject it. On the next step, you divided again. The trust region goes from +/- 90 to +/- 45. You try again on +/- 45 here. You try again, and once again, itfs a failure. Now you go down to 22.5. You predict a decrease of 50. You got a decrease of 50. That was accepted. This shows how the trust region goes down. It starts at +/- 90 degrees. You fail once, twice. Youfre down to 22.5 degrees. Now you go up to 26 degrees. I guess you now fail three times in a row, something like that. I canft follow this. Thatfs about three times in a row maybe. Then you go up, and fail once, and thatfs the picture of how this goes. And then, this is the final trajectory plan that you end up with. These are not obvious. I donft know of really many other ways that would do this. So you end up with this torque, and it kind of makes sense. You accelerate one for a while, and then everywhere itfs negative over here, thatfs decelerating. And you accelerate this one to flip it around, and you have a portion at the end where itfs decelerating like that. And then, herefs the final trajectory. The initial trajectory just went like this. It was a straight line from here up to here, and this one is a straight line from here to here. So I guess theta 2 actually turned out to be pretty close to our crude initial guess. Thatfs an example. Let me say a little bit about this example. Once you see how to do this and get the framework going, anything you can do with convex programming can now be put in here. And thatfs the kind of thing you couldnft do by a lot of other methods. You can do robust versions now. You can put all sorts of weird constraints on stuff. You can ask for -- since the taus actually appear completely convexly. Thatfs not a word, but letfs imagine that it is. In the problem, there are no trust region constraints. So you can put a one norm on the taus instead of a sum of the norm of tau2. In which case, youfd get little weird thruster firings all up and down the trajectory. That would be something that classical methods wouldnft be able to do for you. This method you could actually do by -- this specific problem right here, you could do by various methods involving shooting and differential equation constraint problems. You could do it. Itfd be very complicated by the way. This would be up and running like way, way faster than those other methods. But anyway, when you start adding other weird things like L1 objectives and all that kind of stuff, itfs all over. All that stuff just goes away, and this stuff will work just fine. Wefre going to look at just a couple of other examples of this. Theyfre special cases really or variations on the same idea. Herefs one thing that comes up actually much more often than you imagine. Itfs called difference of convex programming. Although, this has got lots and lots of different names. One is called DCC or something like that but anyway. So here it is. You write your problem this way. You write it as all of these functions. Forget the equality constraints. Just leave that alone. We write it this way. Every function we write is a difference in a convex and a convex program and a function. And you might as by the way, eCan you express every function as a difference of two convex functions?f What do you think? The answer is you can. I mean we can probably find five papers on Google that would go into horrible details and make it very complicated, but you can do it, obviously. This is utterly general, but itfs not that useful except in some cases. Itfs only useful in cases where you can actually identify very easily what these are. By the way, this comes up in global optimization too. Wefll see that later in the class. This is weird because it doesnft make any sense. I donft know a better name for this. We can help propagate it by the way if you can think of a name. Difference of convex functions programming, thatfs closer, but itfs not really working. So I donft know what. Anyway, if someone can think of. Therefs a very obvious convexification of a function thatfs a difference of convex function. It goes like this. If I ask you to approximate f(x) -- g(x) with a convex function. We can do a little thought experiment. If I came up to you and I said, ePlease approximate f by a convex function. What would your approximation be? F. Itfs convex. Now suppose I ask you to approximate a concave function? Letfs go back and think locally. How do you approximate a concave function by a convex function? You can guess. Linear. Are you sure you couldnft do any better? Well, you could easily show the best. It depends on your measure of best or whatever, but you could write some things down, and surely, people have written papers on this and made a big deal out of it. Very roughly speaking, depending on how you define best, the best convex approximation of a concave function is an affine function. If youfre approximating something that curves down, if you put anything that curves up, youfre going the wrong way, and youfre going to do worse than just making it flat. Youfre allowed to do that actually. In the second half of an advanced 300-level class, Ifm allowed to say stuff like that. What you do is simply this. You simply linearize g and you get this thing. This of course is convex because itfs convex, and thatfs constant. Well, this is affine. Thatfs the affine approximation of g. Now herefs the cool part. When you linearize a concave function, your linear function is above the concave function at all points. Everybody agree? Your function curvefs down. Your approximation is like this. Youfre a global upper bound. And what that means is the following. You have replaced g with something that is bigger than g everywhere, and that means that your f hat is actually -- have I got this right? [Student:][Inaudible]. Minus g, so this means that f hat is bigger than f(x) for all x. Thatfs what it means. Let me interpret what this means. Roughly speaking, a convex function is one where smaller is better in an optimization problem. So itfs either an objective or itfs a constraint function. If itfs an objective, smaller means better. If itfs a constraint function and youfre positive, it means your closer to your feasibility roughly. If youfre negative, it means youfre more feasible. Now that doesnft make any sense, but thatfs good enough. It certainly means youfre still feasible. So this is really cool. This says there are no surprises with the trust region. Therefs no trust region. If you simply globally minimize, form the function with this, herefs whatfs cool. You just take a full step, follow the logic. There is no trust region. You just remove it entirely. This says that when you optimize with f hat and then actually plug in the true one, your results with the non-convex problem can only be better. All your constraint functions are actually small. They go down, and your objective goes down, right. So thatfs how this works. This is much simpler, this method. And this is sometimes called convex-concave procedure. It has been invented by many people independently and periodically, and surely will be invented roughly every 10 years. I have no idea who invented it first, but I think I can guess where. I wonft do it. Ifll check actually. I do know. It was invented there, Moscow of course. I know that because itfs a part of potential methods. So herefs an example out of the book in the chapter on approximations. Herefs the problem. Youfre given a bunch of samples from a distribution. Youfve subtracted off the mean or something. They come from something with a covariance matrix sigma true. Our job is given a bunch of samples to estimate the covariance matrix. So the negative log likelihood function, which we are going to minimize, is you just work this out. Itfs right out of the book anyways. Itfs log debt sigma + trace sigma inverse Y. And Y is the empirical covariance of your samples. So itfs that. Therefs no reason to believe that the Yifs sum 0. They wonft. You want to minimize this function, and we look at the different parts. This is good because thatfs a convex function of sigma. Unfortunately, that is bad. This is a concave function of sigma. Now, the usual approach in this is to not estimate sigma but to actually estimate sigma inverse. Youfre welcome to. Therefs a change of variable. So if you look at sigma inverse as the variable. This is trace times matrix R. This is Tr[RY]. Thatfs linear in R. This is log det R inverse. Thatfs convex. Everybody got this? If you want to do covariance estimation, therefs a very fundamental message. The message is this. You shouldnft be doing covariance estimation, at least from the optimization perspective. You should be doing information fitting. T hatfs the inverse of the covariance matrix is the information matrix. Thatfs what you should be optimizing. At the end, you can invert it. Now the problem with that is that the only constraints you can handle now are constraint, which is convex in the inverse of the covariance matrix. By the way, that includes a lot of constraints and some really interesting ones by the way. One is conditional independence. If you take samples of the data and you want a base network, you want sigma inverse to be sparse because a zero in an entry of an inverse of a covariance matrix means conditional independence. So how many people have taken these classes? A couple, okay. How come the rest of you havenft taken these classes? Whatfs your excuse? You just got here or something this year? [Student:]I did. Thatfs your excuse? All right, youfre excused. Have you taken it? [Student:]No. Why not? [Student:]Same excuse came here last fall. Just got here, I see, but youfre going to take this next year. [Student:]Yes. Good, right. Itfs vase networks. All right. That was all an aside. So this problem you solve by just working with sigma inverse. However, wefre going to do this. I want to say the following. I want to say, eNo, no, no. Please solve for me this problem, but Ifm going to add the following constraint. All entries in the covariance matrix are non-negative.f Now of course, the diagonals are obviously non-negative. Thatfs obvious. But this says all of the entries of the variable y are positively correlated. I pick this just to make the simplest thing possible that is not a convex function in the inverse of the matrix as far as I know. So the inverse of element-wise non-negative covariance matrices, thatfs not a convex set. Wefre back to that. That means we have to solve this problem. It is not convex, although itfs very cool. Thatfs convex, this constraint. This term is convex. This is concave. So a concave is negative convex, so itfs difference of convex programming. We canft use DCP, thatfs discipline convex programming. DCFD, thatfs not working. Someone has got to figure that out, a name for this. So this is a difference of convex functions. Thatfs convex minus convex. And so if you linearize the log determinate, you get the trace of the inverse times the difference. This is now constant. This is affine in sigma, and thatfs convex in sigma, and so wefll minimize that. Herefs a numerical example. This is just started from five problems, and these are the iterations, and this is this negative log likelihood. Now, by the way in this problem, Ifm willing to make a guess that this is actually the global solution. To be honest, I donft know it and didnft calculate a lower bound or anything like that, but itfs just to show how these things work. [Student:][Inaudible]. Yes. [Student:][Inaudible] sample of the various [inaudible]. It depends. You donft want to be too smart with these methods in initialization. You can start with an immediate initialization, but then you want to rerun it from different ones just to see which regime are you in. Is it always finding the same one? In which case, you can suspect itfs the global solution, maybe. Who knows? Or if you get different solutions, you just return the best one you find. Yes, you could initialize it with Y if you wanted and go from there. How is this one initialized? [Student:][Inaudible]. Random, for these 10, there you go. Thatfs how this was done. And itfs actually; you want to do that anyway. You donft want to run any of these once. You try them and see what happens, see if you get different ones. I think wefll quit here and then finish this up next time.