All right, I think this means we are on. There is no good way in this room to know if you are -- when the lecture starts. Okay, well, we are down to a skeleton crew here, mostly because itfs too hot outside. So wefll continue with L_1 methods today. So last time we saw the basic idea. The most -- the simplest idea is this. If you want to minimize the cardinality of X, find the sparsest vector X thatfs in a convex set, the simplest heuristic -- and actually, today, wefll see lots of variations on it that are more sophisticated. But the simplest one, by far, is simply to minimize the one norm of X subject to X and Z. By the way, all of the thousands of people working on L_1, this is all they know. So the things we are going to talk about today, basically most people donft know. All right. We looked at that. Last time we looked at polishing, and now I want to interpret this -- I want to justify this L_1 norm heuristic. So here is one. We can turn this -- we can interpret this as a relaxation of -- we can make this a relaxation of a Boolean convex problem. So what we do is this. I am going to rewrite this cardinality problem this way. I am going to introduce some Boolean variables Z. And these are basically indicators that tell you whether or not each component is either zero or nonzero. And Ifll enforce it this way. Ifll say that the absolute value is XI is less than RZI. Now R is some number that bounds, for example -- like it could be just basically a bounding box for C, or it can be naturally part of the constraints. It really doesnft matter. The point is that any feasible point here has an infinity norm less than R. If we do this like this, we end up with this problem. This problem is a Boolean convex. And what that means is that it is -- everything is convex, and the variables, thatfs X and Z, except for one minor problem, and that is that these are 01. Okay? So this is a Boolean convex problem. And itfs absolutely equivalent to this one. It is just as hard, of course. So we are going to do the standard relaxation is if you have a 01 -- 0, 1 variable, wefll change it into a left bracket 0,1 right bracket variable. And that means that itfs a continuous variable. This is a relaxation. And here, we have simply -- we have actually worked out, this is simply -- well, itfs obvious enough, but this is simply the convex hull of the Boolean points here. Now if you stare at this long enough, you realize something. You have seen this before. This is precisely the linear program that defines -- this is exactly the linear program that defines the L_1 norm. So here, for example, this is at norm X is -- itfs an upper bound on -- ZI is an upper bound on one over RXI. And so, in fact, this problem is absolutely the same as this one. And so now you see what you have. That Boolean problem is equivalent to this. By the way, this tells you something. It says that when you solve that L_1 problem, not only do you have -- is it a heuristic for solving the hard Boolean problem, itfs says itfs a relaxation, and you get a bound. The bound is this: you have to put a one over R here, where R is an upper bound on the absolute value of any entry in C, or for that matter any entry that might -- is a potential solution or something like that. And this tells you that when you solve this L_1 problem not only do you get -- is it a heuristic for getting a sparse X, but in fact it gets you within a factor of R a lower bound on the sparsity. So thatfs that. By the way, itfs a pretty crappy lower bound in general, but nevertheless itfs a lower bound. Okay. Now we can also interpret this in terms of a convex envelope. And let me explain what that is. If you have a function, F, on a set C -- and wefll assume C is convex, so on a convex set C -- in fact, I should probably change this to convex set. This function is not convex. The envelope of it, it is the largest convex function that is an under-estimator of F on C. And let me just draw a picture, and wefll -- Ifll show you how this works for the function we are interested in. The function that we are interested in looks like this: itfs zero here, and then itfs one over here. So thatfs our indicator function. And, if you like, we can do this on the interval, plus one minus one, okay. And so our function looks like this, basically. Thatfs our cardinality function. And what we want to know is this. What is the smallest -- sorry, the largest convex function that fits everywhere beneath this function on that interval? By the way, well, I can leave it this way. So the simplest -- any convex functions, itfs got to go through this point; itfs got to go through that point; and therefore, this gives you -- thatfs it. So no one would call -- no one would call this absolute value function a good approximation of this function, for sure. But it does happen to be the largest convex function thatfs an under-estimator of it, okay. So thatfs that. And then, actually, you can go like this, if you like. Because itfs on this -- itfs just on this one set. So thatfs the convex envelope. Now we can relate to all sorts of interesting things. I mean one is this that the -- one way to talk about the envelope of a function in terms of sets is this. You form the epigraph of the function, and then simply take the convex hull. And it turns out thatfs the epigraph of the envelope. And you can see that over here, too. So the original function has an epigraph that looks like this. Itfs all of this stuff, and then this little one tendril that sticks out down there. So itfs everything up here, and one little line segment sticks out there. Convex hull of that fills in this part and this part. And what you end up with is the absolute value restricted to plus minus one. So thatfs going to be the convex envelope here. And another interesting way to say it is this, it is, in general, itfs F star star. And I donft want to get into technical conditions, but -- actually, the conditions arenft that big of deal, itfs just that it should be closed or something like that. Actually, this function here is not closed so -- but anyway, it looks like this. Itfs the conjugate of the function, which is always convex, and then that starred. So itfs the conjugate of the conjugate. So now if the function F were closed and convex originally, this would cover F. Thatfs kind of obvious. Okay. So for X scalar, absolute value of X is the convex envelope of card X and minus one one. And if you have a box of size R, an L infinity box here, then one over R norm X1 is the convex envelope of card X. So that gives you another interpretation. So if someone says, gWhat are you doing?h You say, gI am minimizing the L_1 norm in place of the cardinality.h They can say, gWhy?h You would say, gWell, itfs a heuristic for getting something sparse.h And they go, gWell, thatfs not very good. Can you actually say anything about it?h And you go, gActually, I can. One over R times my optimal value is a lower bound on the number of nonzeroes that any solution can have.h Okay, by the way, if you understand this, you already know something that most of the many thousands of people working on L_1 donft know. You are actually now much more sophisticated. And Ifll show you, and itfs going to be stupid, but itfs actually completely correct. Suppose for some reason I told you that X1, Y is between one and two, so it lies here. What is the convex envelope of that? Well, itfs this. It looks like that. And what you see is that the function is now no longer an absolute value. Itfs asymmetric, okay. So itfs asymmetric. You can write out what it is. Would it be a better thing to do if you wanted to minimize the cardinality? In this case, the answer is absolutely it would be better. It would work better in practice. In theory, of course, itfs better because it gives you the actual convex envelope and so on. So what that says is that when you minimize the one norm of X as a surrogate for minimizing cardinality, you are actually making an implicit assumption. The implicit assumption is that the bounding box of X -- of your set -- that the bounding box is kind of -- itfs like a box. Itfs a uniform, right. All of the edges are about the same, and they are centered. If you ever had a problem where you are minimizing cardinality and X is not centered, like for example, some entries lie between other numbers, you would be using a weird, skewed, weighted thing like that where you have different positive and negative values. Oh, what if I told you that X2 lies between, for example, between two and five? What can you say then? Then X2 is not a problem because X2 will never be zero. So itfs just -- itfs a non -- then itfs easy. Okay. Well, we just talked about this. If you had a -- if you knew a bounding box like this with an LI and a UI, then -- and by the way, you can find bounding box values very easily by simply maximizing and minimizing XI subject to over this set. So you can always calculate bounding box values. Now if the upper bound is negative or the lower bound is positive, then thatfs stupid because that means that X has a certain sign, and there is no issue there. It is a non-issue. If they straddle zero, that means there is the possibility that that X is zero. And in that case the correct thing to do is to minimize this. If these things are equal here, that reduces to L_1 norm minimization. So thatfs what that is. This will also give you a lower bound on the cardinality. Okay. So letfs look at some examples. I think we looked at this last time briefly. Ifll go over it a little bit better this time, or wefll go over it in a little bit more detail. This is a regressor selection problem. You want to match B with some columns of A, linear combinations of columns of A, except I am telling you, you can only use as many as K columns here. So, okay, so the heuristic would be to add, for example, a one norm here and adjust lambda until K has fewer than K nonzeroes. And then what would happen is -- youfd look at this value of that then. So here is the -- here is sort of a picture of a problem. Itfs got 20 variables, 2 to the 20 is around a million. And therefore, you can actually calculate the global solution. You can do it by branch and bound. We are going to cover that later in the quarter, but you can also -- in this case, you just work out all million. So one million least squares problems, youfd check all possible patterns, and not a million. Yeah, yeah, this is a million. You just solved a million of them and for each one. So the global optimum is given here, like this. And this one gives you this -- the one obtained by the heuristic. And you can see a couple of things here. It looks to me like you never -- I am not quite sure here, but I think for most of it you are never really off by -- well, no, here you are off by two. Thatfs a substantial error. You are off by one; sometimes you are exactly on and stuff like that. But the point is that this curve is obtained by the heuristic, which was one-millionth the effort there. Okay. So now wefll look at sparse signal reconstruction. Itfs actually the same problem, different interpretation. You want to minimize norm AX minus Y. Y is a received signal. X is the signal you want to estimate. The fact that there is a two norm here, this might be, for example, that you are doing maximum likelihood estimation of X with a Gaussian noise on your measurement, so you have your Y equals X plus V. Then this is prior information that the X you are looking for has no more than K nonzeroes. So thatfs the -- thatfs the other. The other one, the heuristic would be to minimize this two norm subject to norm X in one less than beta. In statistics, this is called LASSO here. I canft pronounce it -- you have to pronounce it with Trevor Hastyfs charming South African accent. I tried to learn it last time I went over this material, but I never succeeded. So Ifll just call if LASSO. Okay, so thatfs this thing. And another form is simply to add this as a penalty and then sweep gamma here. And in this case itfs called basis pursuit denoising or something like that, so. And I can explain why itfs basis pursuit. If you are selecting columns of A, you think of that as sort of selecting a basis. So this, I guess -- donft ask me why itfs called basis pursuit, but thatfs another name for it. Okay. Letfs do an example. Itfs actually -- when you see these things, they are actually quite stunning. And they are rightly making a lot of people interested in this topic now; although, as I said earlier, the ideas go way, way, way, way back. So here it is. I have a thousand long signal. And I am told that it only has 30 nonzeroes, so itfs a spike signal. Now just for the record, I want to point out that a thousand choose 30 is a really big number. Okay. Just so the number of possible patterns of where the spikes occur is very, very large. For all practical purposes, itfs infinitely large. And here is whatfs going to happen. We are going to be given 200 noisy measurements. And wefll just generate A randomly. And there will be Gaussian noise here, okay. And then you are asked to guess X. Now, by the way, I should point something out. If someone walks up to you on the street and says, gI have a thousand numbers I want you to estimate. Here are 200 measurements.h You should simply turn around and walk the other way very quickly, okay. Get to a lighted place or something like that as soon as you can, or a place with other people. The reason is it doesnft -- this is totally ridiculous, right. Everyone knows you need a thousand -- if you are going to measure a thousand signals, you need a thousand measurements, okay -- so at least a thousand, right, and better off is 2,000 or 3,000 or something like that to get some redundancy in your measurements, especially if there is noise in it. But the idea that someone would give you one-fifth the number of measurements as you have data to estimate and expect you to estimate it is kinda ridiculous, okay. Now, by the way, the flip side is this. If someone told you which 30 were nonzero, you move from five-to-one more parameters than measurements to the other way around. If I tell you that there is 30 numbers I want you to estimate, and I give you 200 measurements, now you are 8-to-1 in the right direction, or you are 7-to-1 in the right direction. In other words, you got seven times more measurements than -- everyone following this? Okay. So, all right, so what happens is if you simply give this L_1 thing -- you just -- well, you can see in this case you just recover it like -- I guess itfs perfect. I mean itfs not completely perfect, some of these. I think the sparsity pattern is maybe perfect. It looks to me like itfs perfect. Yeah, if itfs not perfect, itfs awfully close. By the way, what that means is if you polish your noise will go down lower, and youfll get these very close. You wonft get it exactly because you have noise. But the point is this is really quite impressive. If, in contrast, you would use an L_2 reconstruction, a [inaudible], and you would solve this problem, you can adjust gamma -- basically, it never looks good, the reconstruction. But this would be an example of what you might reconstruct, something like that. So this is the rough idea, okay. So these are actually pretty cool methods. I mean I donft really know any other really sort of effective method for doing something like this, for having -- for saying, look, here is 200 measurements of a thousand things. Oh, and here is a hint, prior information. The thing you are looking for is sparse; please find it. And these work. I should also mention, unlike least squares. Least squares is kind of nice. Itfs a good way to blend a bunch of measurements and get a very good -- it can work beautifully well if you have got like 50 times more measurements than variables to estimate. You use least squares. Almost like magic, itfs a 263 level, right. All of a sudden, from all of these crazy [inaudible] with noise, out comes like a head that you are imaging or something like that. I mean itfs really quite spectacular. And it kind of fails gracefully. I should add something here. These donft fail gracefully. And I bet you are not surprised at that. So if you take this -- which you can, all of the source is on the web, everything -- and you just start cranking up sigma. You crank it up, and up, and up. And it will work quite well up until some pretty big sigma -- youfll give sigma one more crank up and, boom, what will be reconstructed will be just completely -- it will go from pretty good reconstructed to just nonsense very quickly. So I just thought Ifd mention that, so. Somehow itfs not surprising, right. Okay. Let me mention some of these theoretical results. Obviously, I am going to say very little about it. They are extremely interesting. In fact, just the idea that you can say anything at all about it I find fascinating. But here it is. The problem is going to be that the set C is going to be very embarrassing. Itfs going to be an affine set. So basically, suppose you have Y equals AX, where the cardinality of X is less than K. And you want to reconstruct X here. Now, obviously, you need -- the minimum you could possibly have would be K measurement. So in other words, if someone comes up to you and says, gWow, thatfs good. You have got my 30 non -- my spike signal with 30 things. What if I give you 19 signals?h Thatfs not even -- thatfs not enough to get 30. So on the other hand, if the number of measurements is bigger than the size of the number of parameters, then we are back in 263 land, and everything is easy to do. So thatfs trivial. So the interesting part is where M lies between K, thatfs the sparsity -- known sparsity signal and the number of parameters. And the question is: when would the L_1 heuristic thatfs minimizing norm X1 subject to AX equals Y. I mean and notice how simple the set is, C if the set of X -- AX equals Y. When would it be constructed exactly? Thatfs the question. And actually, this -- you can actually say things which are quite impressive. And it basically says this. It says that depending on the matrix A here -- but there is a lot of matrices. Actually, there is a long -- there is a lot of matrices that would actually work here. And there is actually all sorts of stuff known but what exactly what it is about the matrix that does the trick; it has to do with coherence or something like that. And it says basically that if M is bigger than some factor times K -- so this is sort of -- if I gave you the hint, if I told you what the sparsity pattern is, you would need M bigger or equal to K. So the extent to which this number goes above one is how much more you need than the minimum if you had the secret information as to what the sparsity pattern was. And this is an absolute constant C times log N here. Then it says if this works, then the L_1 heuristic will actually reconstruct the exact X with overwhelming probability. What that means is that as you go above this, actually the probability of error goes down exponentially. Okay? Thatfs it. And some valid Afs would be this. If the entries in the matrix were picked randomly, that would do the trick. If A was a rows of a DFT matrix, of a discreet 48 transform, then it would work. As long as the rows are not like bunched up or something like that, then it would work. And there are lots of others. So this is it. And these are beautiful things. I would take you about four seconds to find these on the web with Google, to find references and just get the papers. So, okay, so we will go on to the second part of this lecture, and that is going to be here. There we go. Great. Okay. So we are going to look at some more advanced topics here, so -- and just other applications, and variations, and things like that. So one is total variation reconstruction. I should add that this predates the current fad. The current fad, L_1 fad, it depends on the field. I mean statisticians have been doing it for a long time, like 12 years or 15 years now. People in geology, I think, have been doing it for 20. Others have been doing it probably 20 or something like that. But the recent thing, actually, was spurred by these results. And thatfs in the last five years, letfs say. But total variation, this goes back, I believe, easily to the early e90s or something like that. So it works like this. You want to fit a corrupt -- you have a corrupted signal, and you want to fit it with a piecewise constant signal with no more than K jumps. Well, a simple way to do that is to trade off -- X hat is going to be your estimate of your signal. You trade off your fit here -- by the way, if this were Gaussian noise that would be something like a negative log likelihood function here. You would trade off your fit with the cardinality of DX where D is the first order difference matrix. And what you do is you would vary gamma. If you made gamma big enough, the solution X is constant. And then, of course, itfs equal to the average value. As you crank gamma down from that -- from that number, what will happen is the X hat will first have one jump, so it will be piecewise constant with one jump, then two, then three, then so on and so forth. Okay. So DX1, by the way, is the sum of the differences of the absolute values of a signal -- this is scalar signal for now. Thatfs got a very old name. It goes back to the early 1800fs. Thatfs called a total variation of a signal -- of a function -- a signal, thatfs fine. And this is called total variation reconstruction. And there is a lot of things you can say about TVR reconstruction, but what happens is they actually are able to remove sort of high frequency noise without smoothing it out. Wefll see how that works, like an L_2 regularization will just give you a low-pass filter; it will smooth everything out. Now these, they are very famous here because they didnft -- these were some of the methods do things like recover. These original from the wax recordings of Caruso or something, they actually reconstructed them. They got all sorts of jazz stuff from the e20s and reconstructed them using these methods. And they are amazing. I mean sort of the clicks and pops just go away. I mean they are just -- they just -- they are just removed. Okay. So here is an example. And so here we have a signal that looks like this. Itfs kind of slowly moving, but itfs got these jumps as well. I mean this is just to make a visual point. There is a signal, and the corrupted one looks like this. So there is a high frequency noise added here. Okay. And if we do total variation reconstruction, these are three values of gamma. And they are actually chosen -- one is supposed to be like too much; one is too little; and one is not enough. But you can see something very cool. The jumps are preserved. So -- and thatfs not like -- thatfs not smoothed out at all. Thatfs jump -- thatfs smoothed out perfectly. Okay. And this is sort of too much because I have actually sort of flattened out the curvature that you saw here. This is maybe you havenft removed enough of the signal, but you still get this jump here. And this might be just enough. By the way, if you were listening to this -- in fact, I should probably produce some like little jpeg -- I mean some little audio files or something like this so you can hear total variation denoising. Itfs very impressive. L_2 denoising, in a minute wefll see that, thatfs just low pass filtering. You have a lot of high-frequency stuff with everything just muffled. You hit a drum, everything is muffled. The L_1 [inaudible] will actually cut out this high-frequency noise. But when someone hits a snare drum, itfs right there. So itfs pretty impressive. We should -- it would be fun to do that, actually. Okay. Here is the L_2 reconstruction, just to give you a rough idea of what happens. In L_2 reconstruction, to really smooth out the noise, basically, you lose your big edge here. And this is sort of, maybe, the best you can do with L_2 or the best trade-off. And this would be -- if you still -- by the way, you still -- in this case, itfs not -- that has not been preserved exactly. Itfs actually been smoothed a little bit. You just canft see it here. And you still are not -- you are not getting enough noise attenuation, so just get a picture. Yeah. [Student:][Inaudible] that even this sounds very, very good. This one? [Student:]Yes, L_2 because thatfs the one we didnft do extraneous [inaudible]. Yes. [Student:]This one sounds very good. And why should we go the extra half if we are going for L_1? I mean [inaudible]. Oh, in cases -- I mean to do things like remove clicks and pops. And then, if you started listening carefully, you would find out this did not sound good at all. I mean not at all. [Student:]Okay. Yeah. Because youfd either hear this noise, right, or you start muffling this. And that makes a drum sound like -- then you are not tapping a drum; you are tapping like a pillow or something like that. And itfs no longer a drum. I mean just -- so thatfs the -- if you listen to these things, itfs quite audible. We can adjust the parameters so the 263 methods work well, which of course naturally we do in 263. Wouldnft we? So, okay. So, okay. But actually, the total reconstruction variation is really done more often for images. And I believe itfs also done even for -- in 3-D. And I believe itfs done even for 4-D, so for 3-D movies with space-time. But wefll look at it in 2-D, and itfs quite spectacular. So here is the idea. And this is going to be very crude. And Ifll make some comments about how this works. So what it is, you have X and RN. These are the values on a N by N grid. So our R grid has about a thousand points on it. I mean itfs small, but thatfs it. And the idea is I want -- here is a prior knowledge is that X has relatively few -- so X is sort of piecewise constant. In other words, itfs like a big region. It looks like a cartoon. Itfs got big regions where itfs constant and with a boundary. So everybody see what Ifm saying? So thatfs the -- itfs cartoon looking. It looks like a cartoon, or a line drawing, or whatever. All right, now this problem. You get 120 linear measurements; thatfs, of course, a big joke. You are whatever that is, six or seven times -- I guess you are seven times under or six, whatever it is. Youfre some big factor under here, eight maybe that is, I don't know, eight. So you are eight times under sample. In other words, I want you to estimate 960 numbers; Ifll give 120 measurements. These are exact. These are exact. So the way wefll do this is wefll say, look, among -- this has, among all of the Xfs that are consistent with our measurements, thatfs a huge set. In fact, whatfs the dimension of the set of Xfs that satisfy this? What do you think? Whatfs the dimension on that? You donft have to get it exactly, just roughly. What 840 dimensions? Yeah. You got -- you get 961 points. You get 120 measurements, null spaces on the order of the -- is the difference, right. So, I don't know, you have eight. So this is a huge number of Xfs are consistent with our measurements; 840 dimensional set of images are consistent with our linear measurements. But among those, what wefll do is to pick one wefll do this. We would like to minimize -- this is the sum of the cardinalities of the differences, and let me show you what that is over here. And Ifll explain in a minute how to make this better -- look better, anyway. Itfs this that we have our grid. And basically, we would -- we are going to charge for the number of times two edges -- two values are different. And thatfs both this way and this way. So, for example, that big objective would be zero if the entire image were constant. Otherwise, everywhere where there is sort of a boundary, you are going to get charged, okay. So thatfs the picture. Now we canft solve that problem, but we can solve this variation on it. Now, by the way, when you do L_1 this way on an image, and you just go -- you charge for this way and this way, what happens is you are going to tend to get images -- or youfll get things that will -- they actually prefer like this direction or this direction, and you get weird things. I think we had that in a maybe a homework -- no final -- was it final exam problem 364? Was it? I canft remember. I think it was. We made Jacobfs happy face. Midterm? Final? Midterm. [Student:][Inaudible]. What? [Student:]Homework. Homework. Okay, homework, sure. Anyway, so okay, so letfs see how this works. So here is the -- here is the total variation reconstruction. And the summary is itfs perfect. [Student:][Inaudible]. Yeah. I know. [Student:]I just [inaudible]. Great. Okay. Good. Good, okay. I forget, too. Wait until you see the [inaudible] 364c. [Student:][Inaudible] 364c? Oh, yeah. Yeah. We are starting it this summer just for you. You are already enrolled. Youfll like it. Wefre bringing the final exam back on that one, except that itfs going to be every weekend, though, 24-hour. But you are learning a lot. You are going to learn a lot, though. Okay. So I -- I mean this is -- you get the idea. In this case, you recovered it exactly. So I mean these are kinda impressive when you see these things. Variations on this, by the way, are quite real. This is a fake toy example. You can go look at the source code yourself, which is probably like all of ten lines or something. The plotting, needless to say, is many more lines than the actual code. These are actually quite real things. I mean there is stuff going on now where you do total variation reconstruction MRI from half of this -- a third of the scan lines, and you get just as good an image. So, okay, and this is what happens if you do L_2. And I mean this is what you would imagine it to look like. Thatfs kind of what youfd guess. And you can adjust your gamma and make the bump look higher, or less, or whatever. But itfs never going to look that great. Thatfs -- this is your 263 method, so. Thatfs essentially a least-normal problem. Yeah. Actually, Ifm sorry, there is no gamma in this problem. Thatfs just least normal. Okay. So that finishes up some of these -- oh, I should -- I said I promised I was going to mention how these methods actually done. What you really want in image is you really want your estimate of -- is to be approximately rotation and variant. So, in fact, you get a much better looking -- visually now, if you really do this, not by just taking your differences this way, but youfll also take this difference here as well. And you will -- and that thing, youfll divide by square root two or something like that. And the other, the most sophisticated way, is to take your favorite multi-point approximation of the gradient and take the two norm of it and minimize the sum of the two norms of those gradients. Thatfs the correct analog of total variation reconstruction in an image. And that will give you beautiful results that will be approximately rotation in variant. Okay. So let me talk about some other methods. This is also -- this is just starting to become fashionable, the L_1. And there is other variations on it. And I think people call it beyond L_1 or something like this. And it goes like this. So one way is to iterate, and Ifll give a -- wefll give a very simple interpretation of this in a minute. So I want to minimize the cardinality of over X and C. So what you do is this is instead of minimizing the L_1 norm, wefll minimize like a weighted L_1 norm. Now we have already seen good reasons to do that. The weights correspond -- one over the weights correspond to your prior about the bounding box. So thatfs one way to do -- one way to justify weights. But the idea here is this. You solve an L_1 problem, and then you update the weights. And the weight update is extremely interesting. It works like this. If you run this L_1 thing, and one of these numbers comes out zero, this -- then you get the biggest weight you can give, which is one over epsilon. What that means is thereafter, itfs probably going to stay zero because it went to zero at first. Once you are zero, in the next iteration your weight goes way up. And then there is very strong encouragement to not become nonzero. Whatfs cool about this is if X turns out to be small but not zero, so you are actually being charged for it cardinality-wise, what this does is it puts a big weight on it. And it basically makes that one look very attractive. And it basically mops -- cleans up small ones, just gets rid of them. On the other hand, if XI came out big, you are taking that as a heuristic to mean something like, well look, if you minimize an L_1 norm and something comes out -- one of the entries comes out to be really big, it basically means, look, that thing is not going to be zero anyway. You are not going to drive it to zero. So therefore, relax, and basically says reduce the weight on it. So if it wants to get bigger, let it get bigger. So this is the picture. And this will typically give you modest improvement. Well, I mean it will actually give you real improvement over the basic L_1 heuristic. And it typically converts in five or fewer steps. By the way, a more sophisticated version, actually, is not symmetric here with the weights. Wefll see what the more sophisticated version is, but anyway. So here is the interpretation. So wefll work with a case where X is bigger than equal to zero. And wefll do that by splitting X into positive and negative parts where those are both non-negative. And the cardinality of X then, itfs the same as this thing, I mean provided one of those is always zero. And wefll use the following approximation. Instead of -- let me show you this. In fact, this kind of the idea behind all of these new methods that are beyond L_1. I mean it goes back to -- I mean all of these things go back to stuff that is very stupid. By the way, these things are very stupid, and yet it doesnft stop people from writing fantastically complicated papers about it, right, and making it look not stupid. But they are stupid. So letfs go to one and stop there. Okay. So here is the function we want. It jumps up and goes like that. Here is our first approximation, not an impressively good -- not what you call a good approximation. And so some of the -- this one says you replace it with a log one plus X over epsilon. And, you know, if you allow me to scale it and change various things, thatfs a function that looks like this. Ifll shift it and scale it, if you donft mind. And itfs a function that looks like that. Well, let me just make it go through there. There, okay, so it looks like that. And this little curve at the bottom, thatfs the epsilon like that. So it looks kinda like that. Okay? I drew it with epsilon exaggerated. I really shouldnft have. Let me redraw it, and it looks like this. And then you have got a little thing like that. Thatfs sort of how you are supposed to see it. Okay? And you are supposed to say, well, yeah, sure, okay. This function here is a way better approximation of this thing than this, okay. What? You donft think it is? [Student:]Itfs not [inaudible]. Itfs not convex. You are very well trained. Right. I can tell you a story. A student of Abassfs went into -- they were just talking to Abass. And Abass said, gYeah, but then you can do this problem and maximize the energy lifetime of this thing and blah, blah, blah, like that.h And the student stepped back. And he said, gAre you crazy?h And Abass said, gNo. Whatfs wrong with that?h And the student looked at him and said, gThatfs not convex.h Then he came and complained to me. He said, gWhat are you -- what are you doing?h So you are right. Thatfs not convex. Okay. But itfs okay. Now you are okay. You can handle it. So in fact, this method is -- in fact, not only is it not convex, itfs concave. Now if you have to minimize a concave function over a convex set, when we did sequential convex programming, you saw that there is a very good way to do that. And itfs really dumb. I mean itfs -- what you do is you take a certain X, you linearize this thing at that point, and you optimize. No trust region, nothing. And you just keep going. If you linearize this, basically you get this thing here. This is a constant, and itfs totally irrelevant when you linearize. And you actually get this. And in fact, part of that is a constant, too, like this part. Okay. And in fact, itfs the same as minimizing XI over this. And guess what sequential convex programming applied to this non-convex problem, which is supposed to fit the cardinality function better, yields that exactly. Okay. So this is really an interactive heuristic. Sorry, it is. This is a convex-concave procedure for minimizing a non-convex function versus a smooth function, which is supposed to approximate the -- what do you call it. Itfs supposed to approximate this card function, okay. By the way, there is other -- lots of other methods. And I can say what they are. Here is a very popular one. All of these work. Thatfs my summary of them; lots of papers coming up on all of them. Here is another one. I don't know, here is one; you donft like -- letfs just do it on the positive. You donft like X, how about X to the P for P less than one. So these functions start looking like that. And the small -- if you make P really small, they look just like that card function. And by the way, this leads some people to refer to the cardinality as the L zero norm. Now letfs just back up a little bit there. And two of you have already said that, so you can say it. I cannot bring myself to say that because there is no such thing as an LP norm with P less than one because itfs not convex. The unit balls look like this. And then I canft even say it. You see? Thatfs what happens if you learn math when you are young. That is not the unit ball of anything. And you should not say that. You shouldnft even -- you shouldnft say that. And yet, you will hear people talk about LP norm with P less than one. And itfs not convex. Itfs not a norm, blah, blah, blah. So the methods work like this. In fact, you tell me. Letfs invent a method right now. How would you -- how would you minimize this approximately, and heuristically, and so on? By the way, if you minimize that, you would get a very nice sparse solution, very nice. How would you do it? You just linearize this thing. And what would you -- and what, in fact, would you be doing at each step, and if you did the convex-concave procedure on this guy? You would be solving an iteratively re-weighted L_1 problem. Okay. And the only thing that would change is your weight update would be slightly different from this one. But your weight update would be reasonable. And I always do the same thing. What happens in a weight update is this. Entries that are big, you just say, ah screw it. Thatfs probably not going to be zero, and you reduce the weight. Entries that are small, you crank the weight up. If that thing is already zero, thatfs a strong inducement to pin it at zero. If itfs small, thought, thatfs a -- that makes that thing a very attractive target for being zeroed out. And thatfs what drives the cardinality down. Everybody got this? So thatfs the idea. Okay. Itfs a very typical example is you want to minimize the cardinality of X over some polyhedron. And the cardinality drops from 50 to 44, not that impressive. And if you run this heuristic, I guess six steps it converges, actually, after a couple, it will stop out at -- no, sorry, letfs see. Here we go. L_1 gives you 44. And the iteratively re-weighted L_1 heuristic gets you 36. The global solution, probably, found for this problem in long time, later in the class, is 32. So just emphasize, again, we are not -- we are not actually solving these problems. These are heuristics. But they are fast, and they are good, and so on and so forth. By the way, the fact that you are not solving the problem if the problem is -- for example, is rising in a statistical context, I think means it doesnft matter at all. Right? Because it -- you donft get a prize for getting the exact maximum likelihood estimate. Maximum likelihood estimate is just -- is itself, in some way, itfs just a procedure for making a really good guess as to what the parameter is. And itfs backed up by a hundred years of statistics. Okay. If you miss that -- and by the way, even if you do perfect maximum likelihood, as any statistician or anybody who knows anything about it will tell you, you are not going to be getting the exact answer anyway. Thatfs just some that -- thatfs just some which asymptotically will do as well as any method could or something like that. Thatfs its only -- now by the way, for engineering design, thatfs a different story, right. You find a placement of some modules on a chip that takes 1.3 millimeters as opposed to -- whereas, opposed to 1.6, thatfs real. Thatfs unlike the statistical interpretation. But still, okay. So letfs look at an example of that. Itfs a fun example. Itfs just detecting changes in a time series model. So letfs see how that works. We have a two term ARMA model, or I guess people call this -- Ifm sorry, itfs not ARMA -- itfs AR -- I think, actually, people call this AR(2). So it looks like this. Y of T plus two is equal to a coefficient times Y of T plus one, plus another coefficient times Y of T plus a noise, which is Gaussian. Now the assumption is this: these coefficients here are mostly constant. And then every now and then there is a change in the dynamics of the system. And one or both of those numbers changes, okay. Youfll be observing merely Y. And your job is to actually estimate A and B, and in particular, to find where the changes are. So letfs see how that works. By the way, well, it doesnft matter. I mean I was going to make up an application of it. And it basically says the changes could be -- could tell you something about a failure in a system or something like that, or a shock in a financial system or economic system, something like that. Okay. So here is what wefll do is we will -- given Y, this is a negative log likelihood term here with some constants. Thatfs a negative log likely -- this is an implausibility term. Because, for example, if you run up a giant bill here, itfs asking you to believe that the V did some very, very unlikely things. Thatfs what this term is. And then here, we add in a -- actually, itfs a total variation cost. It basically says it penalizes jumps in A and B in the coefficients. Now, by the way, if I make gamma big enough, A and B will be constant. Okay. If I make them zero, then I will make this thing zero because I can adjust my A and B, in fact many ways, to get absolutely perfect fit here. Okay. So here is an example. Here are how A and B changed, so A is this, and then, I guess, I don't know, I canft see now. But letfs say if T equals a hundred, it changes to a new value. B is here, and a 200 pops up here. So there is three changes. And you can sort of, visually, if you squint your eyes, you can change in the dynamics of the system here that, if you look left of there, you get one kind of dynamics. You can see a little -- some -- with a little squinting, you can see that the dynamics on the left in between a hundred and 200 looks different. And again, between 200 and 300, well, it helps that have I told you what happened. But none of the -- itfs certainly consistent. Now the interesting thing, though, is imagine I hadnft told you this. I don't know that itfs that obvious. I mean certainly this doesnft look very much like that. But would you really know that something happened here? I don't know. You could -- I could have made the coefficients change in such a way that you would -- your eyeball -- you couldnft do it. So if you run this total variation heuristic, on the left this is the estimate of the parameters. And I want to point out this thing is already like very good. Itfs estimated the parameters to be here. It jumps down, and it does some weird thing here. I can explain that a little bit here. Itfs some little false positives. Thatfs a false positive where this thing jumps up. Every time this thing jumps up, there is a bunch of false positives in here, and some false positive jumps in here, and so on. But actually, you know what, this is not bad at all. Not bad at all. This is kinda -- itfs kind of saying that there are weird changes in here and here. If you do the iterated heuristic, you can actually see visually exactly what happens. By the way, this guy is pulled down here because itfs charged a lot. It only makes this big shift for which it pays a lot in this objective to make the -- to try to make this overall objective small. But what happens is actually really, really cool. What happens is this. If you iterate it, this difference is really big. And on the next step, itfs going to get a less weight. So it basically says, oh, you really want to jump here? I am going to charge you less for the jump at this time step. Here, these little guys -- by the way, where itfs flat, it says, okay, you donft want to jump at all. I am going to make -- I am going to charge you the maximum amount. Thatfs the one over epsilon in the weight. And these little ones, thatfs what these L_1 -- iterated L_1 heuristics do. They go up, and they clean up things. They just get totally nailed. And thatfs the final estimate there. And you can see that itfs much, much better. I mean you are actually tracking the parameters very nicely. You might ask why the error here? Why the error here? Why did it miss the time point here? And the answer would be because there was noise. Thatfs why, but itfs still -- itfs awfully good to do this. Okay. It works very, very well. Okay. And our last topic is going to be the extension of these ideas to make matrices and rank. So if you have -- if you have cardinality of a vector -- thatfs a number of nonzeroes, there is a very natural analog for matrices, and thatfs the rank. And by the way, both of these things come up as measures of complexity of something. So in other words, sort of the complexity of a set of coefficients is something like -- I mean this is very rough number of zeroes or something. And the complexity of a matrix also comes up a lot, and thatfs the rank. Now a convex rank problem, thatfs a convex problem except you have a rank constraint or rank objective, these come up all the time. And they are actually related. If you have a diagonal matrix and the rank of it is the cardinality of the diagonal, thatfs kind of obvious. But the interesting part is whatfs the analog of the L_1 heuristic? And it turns out itfs the nuclear norm, which is the dual of the spectral norm or maximum singular value. And itfs the sum of the singular values of a matrix. So thatfs it. Itfs not simple, but thatfs what it is. It is the sum of the singular values. And thatfs the dual of a spectral norm, which you probably didnft know but it kinda make sense. Because somewhere in your mind you should have this map there that associates -- well, it should associate LP and LQ, where one over P equals one over Q is one. But particular pairings should be burned into neurons directly. Thatfs two and two, so dual of L_2 is L_2 type thing. And dual of L_1 is L-infinity; dual L-infinity is L_1. These you should just know. So it shouldnft be surprising that if itfs the maximum singular value, thatfs like and L-infinity on the singular value, roughly; that the dual norm should be the sum of the singular values. Thatfs it. Now if a matrix is positive semi-definite, then symmetric -- positive semi-definite, then the eigenvalues are the singular values, and the sum of the singular values are therefore some of the eigenvalues. Thatfs a trace, okay; whereas, for a vector, if I have a non-negative vector, and I think the one norm itfs the same as the sum. So, oh, and by the way, thatfs why a lot of things would -- I would still call them sort of L_1 heuristics, but you might sort of grip through the paper and never see L_1 mentioned because if a vector is non-negative, itfs just a sum. But L_1 sounds fancier than to say L -- you know, than the sum. The sum heuristic seems kinda dumb. Okay, so this is the -- this is the nuclear norm. And wefll do an example. Actually, itfs a very interesting example. It goes like this. You are given a positive semi-definite matrix. And what you would like to do is you want to find -- you want to put this in a factor model. Now a factor model looks like this. Itfs an outer product, a low rank outer -- a low rank part plus a diagonal. What this -- I mean it would come up this way. It basically says that if covariance of a matrix, it basically said that thatfs -- that random variable is explained by a small number of factors. In fact, itfs the R -- R is the dimension of F, the number of columns. Itfs a small number of factors. And then D is sort of an extra variation. So this would be the factor model. Now by the way, there are some very easy ways to check factor models. If D is zero, how would you approximate -- how do you approximate a positive semi-definite matrix as just low rank? How do you do that? I give you a covariance matrix like the covariance matrix of 500 returns. And I want you to tell me -- approximate it as a matrix of rank five, how do you do it? [Student:]Itfs the [inaudible]. [Student:]Itfs [inaudible] symmetric. So you should say -- I can [inaudible] decomposition, but itfs the same as the SVD. So you take the eigenvalue decomposition, and you take the top five ones. So we know how to do factor modeling without this thing. But if I want to -- letfs do one more. How about factor modeling plus -- suppose all of the -- suppose instead of D this was sigma squared I. Can you do factor modeling for that? How? [Student:][Inaudible] eigenvalues almost [inaudible]. Exactly. So the way you know it is you look at the eigenvalues of something, and you see -- you would see five large ones and a whole bunch of small ones all clustered. And that would be your hint. Okay, so that would do that. That would be one way to do it. So you can do this with this. But when these actually numbers are all different, no one can solve that problem. Actually, itfs a hard problem in general. Well, in any case, this is the -- this is the factor -- simplest factor modeling problem. So C is a set of acceptable approximations to sigma. And they can be -- I mean it could be simple, like some normal, or it can be very complicated and statistically motivated. For example, it could be a Kullback-Leibler divergence, which would be this for a Gaussian random variable. Okay. And thatfs just a very sophisticated way of saying that the two matrices are close. The two variances are -- two co-variances are close. This is the one that would give you the statistical stamp of approval. The statistics stamp of approval would come from using something like that. Then a trace heuristic for minimizing rank is pretty simple. It goes like this: X plus D is your original matrix, and so your variables here are going to be X and -- oh, oh, this should either be -- well, I should either write -- capital D is the diag of little D. So itfs -- but anyway, it doesnft say that here which is weird. But thatfs it. So this would be the problem. And thatfs a convex problem. If you put rank here, itfs a convex rank problem. So wefll look at an example. So here is an example where, in fact, I have a bunch of data, which are -- well, we know it. Itfs actually generated by three factors, all right. So what happens is you get snapshots of 20 numbers. They are all varying. They are random. You look at these 20 -- you look at a bunch of these things, they -- you get a full covariance matrix, full rank. But it turns out that three factors describe it plus the diagonal elements. By the way, in a -- if these were asset returns, the diagonal elements would be called the firm specific. They would be -- thatfs the firm specific variation. Basically, each -- it said that you have a bunch of factors. I think one is the overall market, typically. And then you get some other things. Some very obvious things if you look at factor models in sort of finance you get these things. And then the Dfs are the firm-specific volatilities. Okay. Wefll just use a simple norm -- a norm approximation. And you get a trace -- you get a trace heuristic that looks like itfs a convex problem. And what wefll do is wefll generate 3,000 samples from here. Now, by the way, we are estimating, I guess, itfs 20 by 20 covariance matrix. You have got about 200 numbers. So you are maybe about 15 times as many numbers or samples are there numbers you are supposed to get. I guess each sample is 20. So, okay, itfs a couple of hundred times in terms of the estimate, but you asked me in covariance, so okay. And this is sort of what happens. Itfs rank three. And what happens is, is you crank up beta. You start with a rank -- by the way, the top rank is 20. But you immediately get a 15-rank model. Then, as you increase beta, thatfs the -- that multiplies the trace thing or something like that. What happens is the rank of that X goes down, and down, and down. You get a very steep drop down at three. And by the way, thatfs -- this is the hint that rank three is going to be -- give you a nice fit. If you keep going, if you increase beta enough, it goes from -- it goes down there to two and then stays there. Actually, I guess this would go down to zero at some point. We didnft show beta [inaudible] large. Whatfs interesting is as we scan beta, this shows the eignenvalues. And so up here you have 15 nonzero. And you can see at different values of beta, eigenvalues basically being extinguished. They go to zero. And so what happens is, right here, you end up with three from here on. And in fact, these are the right ones. So if we take beta as some number in here like this, we actually do get a rank three model. You can do polishing in a case like this, too, obviously. Well, you can figure out what polishing is in this case. But in this case, if we take 0.1357, thatfs the tradeoff curve, you find that the angle between the subspace, which is the range of X and the range transposed is 6.8 degrees. And that we nailed the diagonal entries; we actually got the firm, specific volatilities within about 7 percent. So this is just an example of this kind of thing. Okay, so this actually pretty much covers up this -- the whole topic of sort of L_1 and cardinality. And the idea is instead of just thinking it as sort of a basic method where you just reflexively throw in an L_1 norm, actually these extensions show that, first of all, it comes up in other areas like rank, minimization. These internet methods, actually very few people know about them. They are not even used. Most people arenft even using polishing. Most people arenft even using the asymmetric L_1 stuff. So if you are interested in getting sparse solutions, there is actually better things than L_1 available. And certainly, these things like these LP -- I canft say it, LP norms -- LP measures, I don't know. I don't know what word to say there that is okay. Ifll just say it; LP, quote, norms, unquote, for P less than one. Those were these log approximations and these iterations. All of these things work. And I should also say -- so I guess Emanuel Candiz and I had a long conversation like a year ago. And he said that L_1 is the least squares of the 21st Century. And, okay, itfs a good -- thatfs a good -- I mean thatfs good, actually. Itfs not bad. I think itfs a little bit of an exaggeration, but itfs not too far off, right. It basically says that they are going to be the same way everybody needed to know about least squares and throughout the 20th Century, eventually everybody did. And by the way, by least square I mean fancy methods like Kalman Filtering, and quadratic control, and things like that. If you call that least squares, then you know a lot of signal processing, and image processing, and all sorts of other stuff ended up being least squares, period. And I think a lot is going to end up being L_1 as people move forward. So, okay, so wefll quit here unless there is some questions about this material. And then the next topic is actually going to be -- we are going to jump to model predictive control. So wefll -- thatfs going to be our next topic. Good. Good. Wefll quit here.