Great, I guess this means wefve started. So today, wefll continue with the conjugate gradient stuff. So last time -- let me just review sort of where we were. It was actually yesterday, but logic, I mean, logically -- in fact. But we can pretend itfs five days or whatever it would be. So wefre looking at solving symmetric positive definite systems of equations and this would come up in Newtonfs method, it comes up in, you know, interior point methods, least squares, all these sorts of things. And last time we talked about, I mean, the CG Method the basic idea is itfs a method which solves Ax=b where A is positive definite. And -- but it does so in a different way. The ways youfve seen before are factor solve methods. In fact, in those methods what you need is you actually need the matrix. So you actually -- you pass a pointed to an array of numbers, roughly. And then you work on that. Whatfs extremely interested about the CG method is actually the way A is described is completely different. You do not give a set of numbers. In fact, in most of the interesting applications of CG you will never form or store the matrix A, ever, because, in fact, in most cases itfs huge. Itfs some vast thing. Thatfs kind of the point. Instead, what you need is simply a method for calculating A times a vector. So what you really are gonna pass into a CG method is a pointer to a function that evaluates this thing. How it evaluates it is itfs business, itfs none of your business. Thatfs sort of how -- thatfs the idea. Okay, well, last time we started looking at a little bit about this. We looked at two different measures of the error. So one measure is this number tao and itfs the amount of decrease, F is that quadratic function youfre minimizing, youfve achieved divided by -- sorry, this is the amount of the decrease, the sub optimality and decrease divided by the original sub optimality of decrease. Thatfs probably what youfre interested in. But another one which is probably in many applications whatfs actually easier to get a handle on and all that sort of stuff is the residual. And the residual is nothing but, you know, b-Ax. Itfs basically how far are you from solving Ax-b. Now the CG method, and actually many others, actually work with the idea of a Krylov subspace. And this is just to sort of rapidly review what we did last time. The Krylov sequence of a set of vectors is defined this way. Itfs actually -- you take this Krylov subspace and thatfs the stand of b/ab up to some Ak-1b. And that essentially means itfs all vectors that can be written this way. Itfs B times a polynomial of A and that polynomials of degree K minus one. Thatfs a subspace of dimension, oh, it could be K but it actually can be less. Actually, if itfs less than K then it means it youfve solved the problem. Okay, so XK is the minimum of this -- this is the function youfre minimizing, this quadratic function on the Krylov subspace, itfs the minimizer of that. And the CG algorithm and several others generate the Krylov sequence. Thatfs actually the important part. Now, in the Krylov -- along the Krylov sequence obviously this quadratic function that youfre minimizing decreases. Thatfs obvious because, in fact, youfre minimizing over a bigger and bigger subspace in each step and that canft get worse. Now the residual can actually increase, thatfs not monotone. Now it turns out if you run N steps you get X star, that follows from Cayley-Hamilton theorem. And the Kth iterate of the Krylov sequence is actually a polynomial of degree K-1 or less multiplied by B. Now the interesting part and the reason why these methods actually work really well -- although the -- by the way, therefs plenty of cases where youfre gonna do this for such a small number of steps that this is actually not that relevant. Therefs a two term recurrence and the two term recurrence is this, you can compute the next point in the Krylov sequence actually as a linear combination of not just the previous one but the previous one and the one before that and then therefs some coefficient here and wefll see what the coefficients are later. Actually, theyfre not that relevant. What matters is that they exist and are easily calculated. Okay, so wefve seen this and what I want to do now is look at the -- to understand how the CG method works or how well it does. Itfs extremely important to get a good intuitive feel for how it works. When is it gonna work well? By the way, notice that you would never even say anything like that when you talk about, like, you know, direct methods. You wouldnft say itfs good to talk about, you know, letfs say, gWell, herefs a 1000 by 1000 matrix. Oh, yeah, herefs one thatfs gonna work well.h It always works well. You have positive definite matrix, you take a Cholesky factorization, it doesnft depend on the data. I mean, to first and -- to 0 and 1st order does not depend on the data. So okay. With these though you need to know when is it gonna work well because thatfs actually the key to all of these things. So the way to do this is essentially to diagonalize the system. Then when you diagonalize itfs kind of stupid because if A is diagonal and I ask you, gHow do you solve Ax=b?h thatfs easy. Thatfs just the inverse, A is diagonal. So the solution if I diagonalize is actually just this, itfs just the transform B divided by the Ith entry in the transform A which is diagonal. This lambda thing here. So thatfs very simple. But this is gonna give us an idea for when this works well. The optimal value is just this. Itfs nothing but this thing. Thatfs that A inverse B star, A inverse B. Thatfs this thing here. Okay? Okay, now the Krylov sequence in terms of Y is the same except now we can actually look very carefully at this thing because the polynomial of a matrix is really, really simple. The matrix is diagonal so a polynomial of it is simply the polynomial itfs a diagonal matrix and each entry is a polynomial of that entry in the diagonal which are the eiganvalues of A. So you get very simple expressions, all in terms of the eiganvalues now. So it says that PK is the -- one way to say it is that itfs the polynomial of degree less than K that minimizes this sum. And notice itfs got some positive weights, none negative weights here. And then over here you can kind of see whatfs going on. If you look carefully at this you really want -- well, wefll say what P should look like in a minute. P should look like, you know, one over lambda or something like that to make this work well. Okay, so another way to write it, letfs just keep going down here is wefll look at the second expression. The second expression says that this error is the minimum over -- these are the positive weights, and then here you can see itfs lambda I times lambda IP of lambda I minus one. And in fact what that says if you can make P of lambda look like one over lambda on the eigenvalues of A youfre -- in fact if you could have P of lambda I equals one over lambda I itfs done. This is zero and then that says that in fact this would have to equal F star. Okay? So thatfs the idea. And in fact, these -- we saw already in the Cayley-Hamilton theorem that therefs an Nth degree polynomial which in fact we saw exactly what PN is. It has to do with a characteristic polynomial. It gives -- itfs a polynomial that in all cases satisfies P. The P of lambda I equals one over lambda I and thatfs why this conversion ten steps -- sorry, in N steps. Now whatfs interesting here is this is gonna give us sort of the intuition about when this works well. There are lots of other ways to say it. I mean, one is to say -- well, look. A polynomial -- something that looks like that is a general polynomial of degree K with the value that if you plug in itfs probably if you plug in 0 this goes away, you get one. I mean, I switched the sign on it. So thatfs another way to say it is this way. There are lots of these. I wonft go into too many of the details but the important part are the conclusions. So here are the conclusions. If you can find a polynomial of degree K that starts at one thatfs small on the spectrum of A then the Kth -- then actually no matter what the right-hand side B is youfre F of XK minus F star is gonna be small. And in particular this says that if the eigenvalues are clustered into groups, letfs say K groups, then YK is gonna be a really good approximate solution because if I had K clusters of eigenvalues I can take a Kth order polynomial and put it right through, letfs say, the center of those clusters or near the center of those clusters and then that polynomial would be really small on each of those clusters and wefll get a very good fit here. Now therefs another way to do well and thatfs to have YI star small for just a handful of things. So this says if your solution is approximate linear combination of K eigenvectors then YK is a good approximate solution. So thatfs another way to say it. Notice that this statement is independent of the right-hand side and depends only on the matrix A. This one now depends on the right-hand side but doesnft depend on the clustering. It basically says if youfre a linear combination of K eigenvectors then this must be a good solution. So, okay. Now you can do things like this, this allows you -- these are classical bounds. Classical bounds would be things like this; you would -- suppose the only thing I told you about A was that its condition number is kappa? So I give you -- letfs say I can scale A. So I give you lambda min and lambda max and if I put a Chebyshev polynomial on there, thatfs a polynomial thatfs small uniformly across it on that interval, you end up with a conclusion that says this, it says this convergence measure that this thing goes down like that. And this is actually -- this allows you to -- oh, by the way, a simple gradient method would have a kappa here and a kappa here. So if you just use the gradient method to minimize that function F youfd get kappa here and kappa here. And so youfre supposed to say, gWow, thatfs much better because you get square root here,h or something. So thatfs the idea. It turns out -- this is interesting and thatfs fine but it turns out actually that where you want to use CG is where in fact, I mean, this is like many other things itfs an upper bound and in fact, you usually get convergence. Itfs sort of much, much faster. Not to high accuracy but -- So letfs do an example here. So these show you -- this is the function Q. They all start at 1 here. Itfs a 7 by 7 matrix. Now I think it goes without saying that CG is not the method of choice for solving a 7 by 7 positive definite system. Thatfs something -- I guess the time to actually solve a 7 by 7 positive system is down in the -- itfs definitely sub microsecond, you know, obviously. So this is not the right way. Nevertheless, this is sort of the -- wefll just look at an example. So here are the eigenvalues of A, I donft know, itfs one here, I guess itfs around two, a couple down here, another cluster here and an outlier out there. After one step of CG the optimal Q is this thing. By the way, it goes without saying that you donft ever produce these. I mean, if you want keep track of these polynomials thatfs fine, but you donft need to. So this shows you that one and you can see it. Itfs kind of doing its best. Itfs not so small here nor is it small here and itfs pretty big there. The quadratic is this green one and you can see it kind of -- it gets small near this cluster and it kind of splits this cluster and this cluster and goes near it so that you get things like this. The cube term, I think that might be the purple one here, is the cubic term and you can see now cubic is nice because itfs sort of -- you can see three clusters and it does just what you think. It goes down, goes right through this thing. Itfs very small here, small here, small here, over here itfs very, you know, quite small, still small, still small, and then kind of goes down there and itfs small here. So actually you could expect that after three steps of this method youfre gonna get a very good solution. The quartic I think is the red, maybe, I guess maybe thatfs the red. Ifm not sure. I guess thatfs the quartic or something. And the quartic one as you can see goes through -- is now actually picking off bits and pieces. Itfs actually doing things like hitting both sides of it so itfs small on all three here. Itfs small here and now itfs just nailing that one. And then the seventh one is this one and thatfs actually an interpolating polynomial so itfs 0 on all of them and that means that at step seven you get the exact answer. Okay? So, I mean, actually Ifm anthropomorphizing this, obviously. So well, all thatfs actually done is the Krylov sequence is computed. But this gives you a rough idea of how this works. Okay, actually youfll know shortly why it is that you need to know how well CG works because itfs gonna be your job to change coordinates to make CG work. Wefll see that in a minute. Okay, so herefs the convergence and sure enough, you know, you start with thatfs the full decrease, and you can see this sort of after five steps youfve done very well and I guess after four youfve done extremely well and so on. So thatfs the picture. Okay, herefs the residual which in fact does go down monotonically. It didnft have to but it did in this case. Now look at -- I mean, thatfs a fake example. Letfs look at a real one. Herefs an example, itfs just a resistor network with 100,000 nodes. We just made -- itfs a random graph so each node is connected to an average of ten other nodes. So, you know, some big complicated resistor network. So, again, a million nonzeros in the matrix G. And I pick one node and I make that the ground. Okay? Then Ifm gonna do the following, Ifm gonna inject at each of the million nodes a uniform current, a current thatfs chosen randomly uniform on 01 and the resistor values will be uniform on 01. Okay? So thatfs our -- thatfs the problem. It doesnft matter. But in this case if you want to use a sparse Cholesky factorization -- actually, before you ever do it youfll know what happens because you can actually do the symbolic factorization on G and actually calculate the number of fill-ins. Okay? So in this case it required -- there was too much fill-in and I donft know -- I donft know how big it would be but maybe, I donft know, 50 million, 100 million or something like that, nonzeros starting from 1 million. Okay? So -- oh, I should mention this, if the number of nonzeros goes up by a factor of a 2 you are to consider yourself lucky in a sparse matrix thing, right? And that means you must go and make an offering to the god that controls the heuristic of sparse orderings and of ordering in sparse matrix methods. Ten, you know, thatfs okay. Youfre supposed to be happy or thatfs typical. You start getting to fill-in factors like 100 and stuff like that and thatfs because you didnft go make an offering the last time sparse matrices went your way. So, all right. So in this case this problem you canft solve with a sparse Cholesky factorization and I actually shouldnft say that. I should say using the approximate minimum degree ordering method produces a Cholesky factor with too much fill-in. Now, of course, on the big machine I actually could have done it and would have gotten the exact answer but it would have been really long and taken a lot of time. It might be that therefs another heuristic ordering method that would work perfectly well here. I doubt it but anyway, therefs lots of them. Okay, so instead wefll use CG here. Now in this case I do form the matrix G explicitly and all I have to do at each step is I have to multiply by G. But thatfs just a sparse matrix vector multiply is a million nonzeros so Ifm doing like a million flops per matrix vector multiply and thatfs a dominant effort of a CG iteration. So I donft know, how much time does that take? [Student:]About a second. A second. Man, wefve got to work on you guys. This is not cool. [Student:][Inaudible]. What? [Student:]Less than a millisecond? What? [Student:]Less than a millisecond. Thank you, less than a millisecond. So around a -- letfs just say a millisecond and letfs just get the order of magnitude right, millisecond. Okay? So matrix vector multiplies a millisecond here, roughly. Is that right? Thousandths of -- yeah, sounds about right, right? Yeah, sure. Weird. Isnft that strange? Man, these computers are fast. Okay, that shocks me. All right, so itfs a millisecond, matrix vector multiply. And it might be a few milliseconds because of all sorts of, you know, issues with accessing memory and stuff like that. But if itfs set up right and youfre lucky itfs on that order of magnitude. Okay, so herefs how CG runs and this is a residual here. So -- and you can see that, well -- for ten steps the residual actually goes up by a factor by 100 -- generally considered not good. And then it goes back down again. But the wild thing is -- the theory tell us the following, the theory says that if you run it one -- what did I - was 100,000? Yeah. The theory says if you run it 100,000, you know, the millisecond doesnft sound right to me but Ifll have to think about that. I think thatfs -- I have to do that for each one or something? Anyway, maybe it is right. It doesnft matter. The theory tells you that K here runs out to 10 to the 5; wefll get the solution exactly. But the wild part is, is if youfre not picky about super high accuracy you actually have a perfectly good solution in 50 steps. Each step was a matrix vector multiply and if our estimate of a millisecond is right it means you just solved a very large, you know, diffusion equation, Poisson equation, whatever you want to call it. You just solved it in a quarter second. I mean, if wefre right or, you know, something like a tenth of a second. Everybody see whatfs going on here? By the way, absolutely nothing in theory guaranteed that this would happen, absolutely nothing. Okay? It just did. And this is very common. So -- and this is kind of the cool part about CG is that in a shockingly small number of steps you often -- what emerges is something that looks kind of like the solution. In this case it doesnft look like the solution; itfs actually, like, quite good. Okay? So thatfs the idea. So if you wanted to know at what point have you learned something new, you just did. You cannot -- if you go just type this thing, you know, G\I or whatever, and let a sparse, you know, even if you have a big computer itfs gonna take a long time and a lot of RAM. And then thisfll just get a pretty good answer in 50 steps and just be way, way shorter. Okay? Okay, so here is the CG algorithm. There are many variations on it. People seem to focus more on the algorithm itself than actually on the -- what the algorithm produces. In my opinion itfs much more important what the algorithm produces which is the Krylov sequence and as many -- there are other methods to produce the Krylov sequence and they have different round up properties and things like that. Let me show you what those are. So herefs one and this is -- instead of just sort of making up my own I just followed exactly one from a very good book on the subject by Kelly. So just to make it -- not to invent new notation or anything, itfs this. And the only important part you need to know here is something like this; you maintain the square of the residual at each step. If the residual is small enough, you quit. So this quitting criterion epsilon is on the ratio, itfs what we call ada; itfs on the ratio of the current residual to the norm of B. Thatfs this thing. And what you do now is something like this. Your search direction is gonna be something like P and itfs a combination of both the current residual and the previous search direction. Then all of the effort in here, well, not necessarily but in most cases is right there, this one thing right here. This is where you call the mult by A method is called right here. Everything else if you look here is actually sort of an O of N or a blass level one or however you want to call it, call. So, for example, here you update a vector thatfs O of N, thatfs O of N, you have to calculate an inner product. Actually, if you want to parallelize this thatfs the one that is really irritating. Everything else here can be done in a -- is completely distributed. So the main effort here is this call to A. These other things I guess this is also has to be collected together so thatfs another one that would not be distributed easily but thatfs the algorithm. By the way, these calculations can be rearranged like 50 different ways and so you get different versions of it. In exact arithmetic all of those ways will compute exactly the same sequence XK. With round off errors in there they can be different and youfll find people talking about one versus the other and this that and the other thing and youfll find all sorts of different flavors and things like that and people telling you one way is better than another and all that. Yeah. [Student:]Is there anyway to know, like, if it will be faster than a theoretic convergence? No, I think the theory just gives you, like, rough guidelines and basically says if youfre -- if the eigenvalues are clustered -- for example, if theyfre tight, if the condition number of the matrix is small that condition number [inaudible] will tell you itfs gonna nail it in 50 steps. Okay? If theyfre spread out but have, you know, clusters or something like that, itfs gonna nail it, that kind of thing. So in general you donft know. Kind of the worst things you can have are ones whose spectrum is sort of uniformly spread all over the place, right? Thatfs the kind of the worst thing. Now it also depends on the right-hand side. So if the right-hand side -- the B, actually if that one -- the worst thing that could happen is B can be sort of a uniform mixture of all of the eigenvectors and that would be kind of the worst thing to happen or something like that. Okay, so letfs talk about -- so as I said at the beginning this is mostly interesting. I mean therefs a fundamental difference between this and a direct method. In a direct method you give the matrix A, you give the coefficients to the algorithm literally. You pass an array or some data structure and you get the entries. In CG you do not need that. All you need is a method that multiplies by your matrix A. Therefs nothing else you need. Okay? That actually in interesting cases that can be some, like, specialized hardware, it could actually carry out an actual experiment. I mean, it could do -- it could solve the whole PDE, it could do insane things, right? Do a full up simulation of something, run Monte Carlo, it could be all sorts of weird stuff. But it doesnft have to be a matrix, is the point. So herefs some examples, why canft you do an efficient matrix spectrum multiply. This is kind of obvious, if A is sparse, for example, if itfs sparse plus low rank now you better know the factorization here. And if you want some numbers here you should think of A as like a million by a million. Thatfs just kind of, you know, because if A is, I donft know, 10,000 or something this is kind of not worth it or something. Yeah, so you should -- your mental image is should be that Afs a million by million or 10 million by 10 million. Thatfs a good number. A 100 million by 100 million. These are the numbers you want to think about when you think of CG and how these things work. If you have products of easy to multiply matrices, that works. Fast transforms come up so FFT, Wavelet, DCT, these types of things, things like fast Gauss transform that actually is when A looks like this and you do this via multiple pull methods. This is actually an extremely interesting one. Herefs a matrix that is extremely -- that is dense, well, half dense, and that this the inverse of a lower triangular matrix, in fact, even the inverse of a sparse lower triangular matrix, thatfs a great example. So I give you a sparse lower triangular matrix, by the way, the inverse of that is generally completely dense. So what -- and I couldnft even store -- letfs make it a million by million, in fact, letfs make it a million by million banded. The Cholesky factorization of a million by million banded, of course, I could solve that directly, but anyway. A million by million banded or something like that is actually gonna be banded. Itfs gonna have -- itfs small, itfs easy to do. The inverse is gonna be a full lower triangular matrix. You canft even store it and it would be completely foolish to actually calculate the inverse matrix and then multiply each time. But if I give you a vector and I ask you to multiply by the inverse, it means you have to solve a lower triangular set of equations and that you can do by back substitution. If itfs a -- Ifm assuming -- Ifm taking the case where itfs sparse. Okay, so these are just examples. Itfs very good to think of examples like this. Okay, now couple things you can do. You can also shift things around. So if you have a guess X hat of the solution X star then, in fact, what you can do is actually -- well, you can -- this is the residual with your guess. If you solve Az equals this residual then your solution is actually gonna be just if you solve this you get the optimal Z and you add it to X hat and thatfs your solution. What happens now is that XK now looks like this. Itfs actually X hat plus this and itfs the argment of this quadratic over the shifted Krylov subspace so you shift by this X hat. And therefs nothing you have to do to make this happen, all you need to do is initialize not with 0 in the first step but with X hat and everything will work. Now this is also very important because this is good for worm start. So what this means is, if you need to solve, letfs say a giant circuit equation, giant resister equation, if thatfs part of integrating a circuit, I mean, sorry, integrating a differential equation for a circuit you will step forward one couple of picoseconds or whatever and youfll solve a similar system. You can use the last thing you just calculated as your guess and this can often give you, like, stunningly good results. So as you will see an application of that in optimization soon. Okay, but now we come to the real way itfs used and I should say this. I should say that if you -- in most cases if you simply run CG on a problem it wonft work. The theory says it will work but thatfs not relevant for several reasons. Itfs not relevant number one because you have no intention of doing a million -- if you have a million dimensional system you have no intention of running a million steps because you donft have that much time. Thatfs the first thing. So to say that it works in practice means that it works in some very modest number of steps. Or a lot of people I remember hearing -- I hear this in -- herefs something you hear on the street that youfre doing on CG when youfre doing -- pretty well when youfre doing square root N steps. Theory says you have to do N but the word on the streets is square root N should do the trick for you. And if you think carefully about what that means itfs just a rule-of-thumb it says if you have a million variables in a thousand steps you should have a pretty good solution. Okay? These are just -- Ifm just saying these -- it could be slow, you know, but you shouldnft be shocked if itfs on the order of squared N. That should be the target. So 100 million, 10,000, that kind of thing. Okay? These are the numbers. Okay, and the other reason itfs not that relevant is the following -- the theory is that errors in -- when youfre doing conjugant gradients round off errors they actually add, itfs unstable and the thing diverges, actually, very quickly. So by itself CG generally -- often will just fail. If you just have some problem and you run it, itfs just gonna fail. Okay, so the trick in CG is actually changing coordinates and then running CG on the changed -- the system in the changed coordinates. Thatfs the trick. Thatfs called preconditioning and now you know why it is that you needed to know when CG works because the key idea now is to change coordinates to make the spectrum of A, for example, to make the spectrum of A clustered or something like that or live in a small interval. So thatfs the key. So herefs the basic idea, youfre gonna change coordinates, youfre gonna use Y is the coordinates of X and some T expansion and whatfs gonna happen is youfre gonna solve this equation here and then in the end set X stars to inverse Y. Then sometimes people call T the preconditioner but then sometimes they call TT transpose the preconditioner and you get all combinations of preconditioner meaning T, T transpose, M, T inverse and T minus transpose. So there may be some other possibilities like M inverse but the point is that anytime anybody says preconditioner you have to look very carefully and figure out exactly what they mean. Okay, so thatfs the preconditioner. And in a -- basically what happens is when you -- if you rerun -- if you just take CG itfs actually not that bad. You have to multiply by T and T transpose at each step. The other option is to multiply simply by M once and you donft really need this final solve, in fact. So this is called PCG. And by the way, this is exactly where you -- this is why in these iterative methods you have room for personal expression, right? So if youfre doing -- if youfre solving dense equations there is no room for personal expression. If you do anything other than get glass and run Atlas to [inaudible] or cash sizes and things like that and then write good code then itfs just not right. There is no reason under any circumstances to do anything else. You go to sparse matrices -- actually, in sparse matrices is there room for personal expression if youfre doing sparse matrix methods? Yeah, and what is it? [Student:]Ordering. Ordering, yeah. So there is some room for personal expression in sparse matrices and as to selection of ordering and thatfs cool. Actually you can say, gOh, I know my problem. I know a better ordering than approximate minimum degree is findingh or something like that. Or therefs exotic ordering methods. You can go look at Netus is a whole giant repository at Minnesota thatfs got all sorts of partitioning methods and some of those are rumored to work really well. Okay, but we move up the scale one step higher to iterative methods. Now, boy, is there room for personal expression and it is mostly expressed through the preconditioner. So, in fact, when you get into these things youfll find everything is about finding a good preconditioner for your problem. And then you can go nuts and you can have simple preconditioners and complex ones, unbelievably complex ones and things like that. So thatfs the -- therefs a lot to do. Okay, so the choice of preconditioner is this. Herefs what you want. You want the following. You want a matrix T for which T transposed AT or a matrix M, letfs say M. I want a matrix M for which the eigenvalues of MA are clustered, for example. And herefs one choice, how about this? The inverse. Thatfll do it because now the eigenvalues of MA are all 1; CG takes one step because the eigenvalues are all 1. Itfs kind of stupid though because you actually then have to, like, sort of invert the matrix or something like that. So that doesnft make any sense. So herefs what you really want. What you really want is a matrix M which somehow approximately is some kind of approximate -- well, see approximate can be very crude. An approximate inverse of the matrix and yet is fast to multiply. Thatfs what you want. Okay? So thatfs the essence of it. And so this is the idea. And the M -- this M can be quite -- youfll see approximate it like very generous here. I mean, you can be way, way off. Herefs some examples. The most famous one and often very effective is diagonal preconditioning. Itfs kind of stupid but actually you should try that always and immediately first because it often just works. So thatfs -- you would not call the diagonal of the inverse of the matrix or whatever you would not call that an excellent approximation of the inverse. But you have to admit itfs cheap to multiply and all that kind of stuff and it works quite well, amazing. Herefs another huge family, itfs actually really cool and it works -- itfs called incomplete or approximate Cholesky factorization. And so herefs how that works. Now what Ifm gonna do is Ifm gonna compute a Cholesky factorization not of A but of some A hat. And the way you might do it is something -- itfs weird. It might work like this, you might run a Cholesky factorization, in fact, therefs a whole field on this called incomplete -- itfs also called ILU, incomplete LU factorization. These are preconditions based on this. And the way it would work is this, you take the matrix and you start doing a Cholesky factorization on it but you might decide if an entry is too small when you do the Cholesky factorization -- you say, gOh, screw it. Just forget it. Ifm just gonna ignore it.h Or if youfre doing something and youfre gonna fill in in various places which is the death of direct methods, you just say, gForget it. Ifll just ignore it.h So you end up calculating a sparse -- well, itfs a sparse Cholesky factor or something but itfs definitely not the original matrix, right? So these methods can also work really, really well, these things. And here will be an example; you can do the central KY band. Thatfs a version of this. Itfs a version of this, too, where you just basically say, any fill in of L below some band I refuse to even -- I wonft even go there. So letfs see, these are some obvious -- anyway these are sort of the obvious things. These can work really well and a good example would be something like this, if you have a problem where therefs a natural -- where something is ordered, for example, in space or in time like in signal processing or control you have time. Then what happens is, you know, things are connected locally, you know, states, transitions, signals and things like that and that leads to this banded system. Now banded systems you can solve super fast, we all know that. But supposing you have a dynamic system or signal processing but a few things were kind of -- you had a main bandwidth of, like, 10 or 20 and then a light sprinkling of little things all over the place everywhere else. So for some weird reason in your problem, you know, the state here is related to the -- itfs just, I donft know, Ifm just making this up. But the point is you could easily make up an example like that. Everybody see what Ifm saying? So itfs kind of banded plus a little light sprinkling of nonzero entries other places, okay? So this would work unbelievably well. You simply do a banded -- you simply ignore all the crap off some band, you do a Cholesky factorization on that, thatfs your preconditioner, and the only thing youfre working the error to fix is all that little light sprinkling of entries that were outside that band. So thatfs the idea. By the way, a lot of this stuff that Ifm talking about, I mean, these are whole fields, I mean, this is the basis of all scientific computing, PDEfs, all -- so therefs tons of people that know all of this stuff backwards and forwards. A lot of this stuff though hasnft diffused to other groups for some reason. So these are actually just really good things to know about. Okay, so herefs that same example with 100,000 nodes and a million resistor circuit and here itfs just with diagonal preconditioning. I mean, itfs kind of silly because it was already working unbelievably well but you can see this is what diagonal preconditioning does. Diagonal preconditioning actually is mostly useful not for this kind of speed up but for when this residual goes like this and goes like that and then it would go like this, okay. Theory says it doesnft do that, thatfs all from round off error. So that how the -- and in fact, herefs a very common CG stopping criterion used on the street, this. You run CG until the answer starts getting worse then you stop, then you just say, gWell, thatfs it, sorry. Herefs what it is.h So -- and I know thatfs done and I know people who do that in image processing and computational astronomy and all sorts of things. They just -- they run CG until it starts -- that usually only gets it a couple hundred steps and then basically you can keep running CG but things are getting worse because of the numerical errors youfve lost or foganailty and all sorts of things happen. Okay, so herefs the summary of CG. Actually, the summaryfs all that matters. So here it is, in theory, and that means with exact arithmetic, itfs not particularly relevant. It converges to a solution and N steps, period. Thatfs not interesting or relevant because N you should think of is on the order of a million roughly and the whole point is you have no intension of doing a million steps. Your goal is to do it in a thousand steps, couple hundred, whatever. So thatfs kind of your goal. I mean, if youfre really lucky, 10, 20, 30, these are the numbers you really want. Okay, now the bad news is that if you -- is that actual numeric round off errors actually makes this thing work much worse than you would predict. Now the good news though is that with luck that means a good spectrum of A you can get good approximate solution in much less than N steps. Right? So now the other main difference with a CG type method or something like that is this, and this is very important, you never form or need the matrix A. Anyway, you canft store a million by million dense matrix anyway. Well, I mean, maybe you could but the point is you donft want to. So the point is you donft need the matrix, you donft need the coefficients the way it works for direct methods where you pass in a data structure and itfs got all the coefficients in it. Okay? Here you donft need it. What you need only a method to do matrix multiplication. That means you have to rethink in your head all -- everything you knew about linear algebra and you have to rethink to this model where what you really have is a matrix vector multiply oracle and thatfs it. What the oracle does, how itfs implemented is none of your business. You can call it, you can give a vector and it will come back with AZ. And by the way, you could -- this would be very bad, but you could actually get the coefficients in A from an oracle. How would you do that actually? [Student:]Multiplying by -- Yeah, EI. [Student:]-- EI. Exactly. So you call the oracle with E1 and what comes back as the first column of A? Then you do it for E2, second column, and actually after EN youfve got all of A. So then you could pack it in there and then call it LA pack, whatever. But thatfs not the point here. Okay, and itfs very important to fix in your mind how this is different from factor solve methods, okay? Itfs less reliable, itfs data dependent. So, for example, that circuit I showed you that was with a random topology. I might take another circuit thatfs got like a pinch point or something in it, 10,000 iterations, nothing happens or itfs even worse, so it diverged. These are data dependent, okay? And this is not the case roughly speaking for direct methods. Theyfre data dependent. And -- but there is -- you can either think this is the bad way or the good way, the bad way is this, these methods donft work in general unless you change coordinates with a clever preconditioner. Thatfs the bad way. The good way is to say is that CG methods have lots -- I mean, the employment prospects are very positive because you donft just call a CG method, you need somebody to sit there, figure out the problem, try a bunch of preconditioners and tune things and find out what works. So thatfs -- you can think of that a good way. Therefs room for considerable personal expression in CG methods. On the other hand you can hardly complain, this is really your only method for solving a problem with a million, 10 million, 100 million variables. So if you made the mistake of going into field where thatfs needed then thatfs your problem. [Student:]What -- so if you have any operator form itfs a very, very hard to get some of the preconditioners you mentioned before. You just use easier ones? Yes, so thatfs -- okay, yeah. So if A is an operator form usually you know the operate -- I mean, typically you know the operator so itfs not this thing where you donft even know what the oracle does. The theory says you could do that. You know, to get -- well, no. So for that you have to sneak around the oracle. Itfs a good point, you know, you have your -- itfs not a pure oracle protocol. Just to get to diagonal youfd sort of sneak -- youfd send a messenger around in back of the protocol and say, gWould you mind telling me what your diagonal is?h Normally a reasonable oracle will be willing to tell you the diagonal. So -- but thatfs a very good point. If you have to find your diagonal by calls to the oracle youfre screwed. Yeah, okay, thatfs -- any other questions about this? Because wefll go on to the next -- the topic which is actually just a -- itfs kind of obvious itfs just applying this to optimization. Okay, so the -- these are called truncated Newton Methods. Actually, therefs a lot of other names for it, wefll get to those in a bit. And theyfre called approximated, truncated and you can do this all the way up to interior point methods. So, letfs see how this works. So herefs Newtonfs method and in Newtonfs method you want to minimize a smooth convex function, second derivatives. And so what you do is you form -- you want to calculate the Newton step by solving this Newton system here. Itfs symmetric positive definite, thatfs symmetric positive definite. And you might do that using a Cholesky factorization. This would be the standard method, you know, for problems that are either dense with up to a couple thousands variables or problems that go up to 10,000 or even 100,000 something like that but where the sparcity is such that the Cholesky factorization can be carried out. So thatfs what you do here. And then you do a backtracking line search on the function value. It could be on the function value or the norm, either way, and youfd stop basically based on the Newton decrement, thatfs this thing, or the norm of the gradient. So that would be your typical stopping criterion. Okay, so an approximate or inexact Newton Method goes like this, instead of solving that Newton system exactly wefre gonna solve it approximately. So -- and the argument there is something like this, you donft really have to get the Newton direction exactly, thatfs kind of silly. In fact, for sort of convergence you only need a dissent direction. Of course, the whole point -- the advantage of the Newton Method is these -- especially for these smooth convex functions is that it works so unbelievably well and you get your final quadratic conversions and things like that. So what you really want is to trade off two things. What you want to do is you want to do a fast and sort of crappy job of approximately computing the Newton steps. You want to get a fast delta X. But if itfs too crappy a job then the algorithm is actually gonna make slow progress. By the way, as to whether or not itfs gonna make progress at all itfll always make progress because you will see that almost any method for approximately solving a Newton step, every single step is gonna generate a direction which is a dissent direction. And so convergence of the methods, at least theoretically is guaranteed. What you can lose is youfll lose things like quadratic terminal convergence and things like that. Now if youfre solving gigantic problems you may not be interested in quadratic terminal convergence, youfre interested in just solving it even to modest accuracy but in, you know, the amount of time, for example, less than a human lifetime. I mean, thatfs your -- but again, this is your problem for solving such big problems. This is your fault, I should say. Okay, so this is sort of the idea. Now here are some examples, one is this, is to simply replace the Hessian with a diagonal or another one is a band of this and use that as the Newton step because if thatfs diagonal or banded this can be solved incredibly efficiently. So thatfs one method. Another method, and I think we even explored these in 364a a bit or maybe you did a homework problem on this or something. I canft remember. If you didnft, you should have. So -- you did? Okay. Another one is to factor, this is called the Shimansky Method is you factor the Hessian, every K iterations and then use that for a while. And, of course, that requires a method -- a factor solve method. Now the factor solve method -- oh, that just reminds me, I have to say something. In a factor solve method if itfs dense or in some dense factor solve methods therefs a big difference between the factor and the solve. The factor typically goes like N cubed and the solve goes like N squared. Right? So what that says is you get weird things you can say. You can say things like this, gI need to solve Ax=b.h And you say, gNo, actually I need to solve Ax=b, Ay=, you know, Ax2=b2, Ax3.h You want to solve it multiple times with multiple right-hand sides. In a dense factor solve method itfs the same cost because the expensive part was the factorization. Once youfve factorized the -- you can solve -- you put an investment in and you can now solve multiple versions of that problem very cheaply. Thatfs the key behind this. Okay? Iterative methods, nothing like that, none. Iterative -- want to hear the cost of solving Ax=b by an iterative method if thatfs like C? Then the cost of solving Ax=b and Ax~=b~ along with the other one? 2C. Therefs no speed up. You simply call the solver again. Everybody see what Ifm saying? So some things that you probably just got used to thinking about, for example, if you factor back solves are essentially free because theyfre in order less, now donft transfer to these iterative things. Okay, so this is the Shimansky Method. It also works quite well. Okay, truncated Newton Method is very, very simple. It does this, youfre gonna run either CG or PCG and youfre gonna terminate early and in fact, itfs very important to understand the key is not to terminate early, the key actually is to terminate way early. Thatfs kind of the hope here. Thatfs -- if you want to get stunning results itfs gonna have to be something like that. Now, these also have lots of other names. Theyfre called Newton Iterative Methods, theyfre called -- itfs also called Limited Memory Newton. Itfs also called Limited Memory BFGS. You end up with exactly the same -- you have to go through all this horrible equations, everyone has their different notation system, but in the end you find out itfs sort of the same method. So these methods have lots of names. Okay, now in a situation like that the cost is not the cost -- the cost per iteration is completely irrelevant. So the cost is measured by the number of CG -- actually, itfs the number of calls you make to the multiplied by the Hessian. Thatfs actually -- thatfs the exact cost is that. Thatfs the number of CG steps. And to make these things work well youfre gonna need to tune the CG stopping criterion, you want to use just enough steps to get a good enough search direction. If you use too many CG steps then youfre gonna get a nicer search direction but itfs gonna take you longer to get there. If you use way too many CG steps youfre gonna basically be doing Newtonfs Method now. Thatfs great because now, you know, whatever it is, 12 steps, itfs all over. But each of those steps is gonna involve 4000 CG iterations or something. At the other extreme if you have too few steps youfre basically doing gradient, itfs gonna jam and itfs gonna take a long time. Okay, now itfs, of course, less reliable than Newtonfs Method which is essentially completely reliable. But, you know, again, with good tuning, good preconditioner, a fast Hessian multiply and you need some luck on your problem, you can handle very large problem. I should say that what -- I should say something else about these methods. Whereas one can write a general purpose convex optimization solver and youfve been using multiple ones, youfve been using SDPT3, [inaudible], all these things. If you think about it thatfs really very impressive. I mean basically they donft know what problem is gonna be thrown at them. They can be scaled horribly. I mean, you can make it too horrible itfs not gonna work but they can be scaled horribly, all sorts of weird things. You can have weird sick, you know, flat feasible sets and all sorts of weird stuff people throw at these things every day and they do a pretty good job. Theyfre general purpose. Okay? So but when you get into these huge systems itfs much more ad-hoc and ad-hoc means literally it means everything is built, you know, for this. So, although people have tried to come up with sort of general purpose iterative methods and theyfre getting close with some things, I think, in some cases. But generally speaking what this -- youfre gonna need to tune stuff. Youfre gonna need to -- I mean, it has to be for a particular problem. You have to say, gIfm interested in total variation denoising.h Or, gIfm interested in this pet estimation problem.h You know, in medical imaging or something. I mean, it has to be a specific problem. gIfm interested in this flow -- aircraft flow scheduling problem.h Okay? So having -- now the good news is this, once youfve fixed to a particular -- itfs, by the way, not a particular instance of the problem but the problem class. Once youfve fixed to one of those things I am not aware of any case where these methods cannot be made to work. Obviously thatfs not a general statement. I canft back that up by anything but every time Ifve ever looked at any problem, as long as you say itfs a specific thing, like itfs a network flow problem, itfs a this problem, itfs a gate sizing problem, itfs a problem in finance or machine learning, these work always. They donft work in 15 minutes, they work in -- it takes a while, right? Maybe not a while, so thatfs kind of the good news of these methods. So -- and itfs kind of the answer to this, maybe once a day somebody comes up to me and whines, often by email actually, because theyfre not here. But they whine, they go, gI have my problem CVX, you know, I canft solve ith and blah, blah, blah. And okay, youfre like, gIt worked fine for 10,000 variables, it doesnft work for 100,000.h And Ifm like, gWell, first of all itfs mat lab. Itfs made so you could rapidly prototype your problem. You need to solve 100,000 variables, you need to know how these things work and stuff like this.h So this is the answer to people like that. Theyfre like, gNo, I donft want to write my own software, itfs too hardh and everything like that. Ifm like, gThen donft solve problems with 100,000 variables.h Just go away, donft bother me, or something. So -- but the bottom line is you want to solve a problem with 10 million variables, 100 million in a specific area, like a specific little area, itfs gonna work basically. So itfs gonna require some luck and itfs not gonna happen in 15 minutes. Okay, so herefs truncated Newton Method. Youfll do a backtracking line search on this, you can do a backtracking line search on F as well. The typical CG termination rule is gonna stop when this is your Newton System residual here, divided by the gradient, which by the way is the right-hand side. So this is exactly the ada that we had before. This is an okay one, it says -- all of these things are kind of heuristic and stuff like that. So exactly what you use to stop maybe doesnft matter so much. And what youfll do is youfll have simple -- with simple rules youfll iter out at some constant number and youfll have this epsilon PCG, thatfs constant. And so this might be like .1. That would be a reasonable number there. You wouldnft want to put .01 and you sure as hell would not want to put 1E-4, 1E-4 says youfre basically calculating the Newton direction and therefs no need for that. Well, maybe to get the thing up and running and verify your code works and stuff like that, you might start with a small problem and make this 1E-4 just to make sure youfre actually calculating the Newton step and it should look then exactly like a Newton trace at that point. But you might want this to be .1. So a more sophisticated method would adapt the maximum number or this thing is the algorithm proceeds and the argument would go something like this. It would say that, look, early on you donft need to solve the -- the whole point of Newton is that it changes coordinates correctly in the end game so that if youfre stuck in some sort of little canyon instead of bouncing across canyon walls as you would in the gradient method youfre skewed towards a direct shot at the minimum. That is what Newtonfs Method is. So youfd argue that actually Newtonfs Method at first youfd totally -- I mean, in many cases youfre wasting your time. So then you might have actually at first this might be, like, very -- this might start .1 and then it might kind of get -- go smaller or something and that might modulate depending on how close you are to the solution. Therefs a lot of lore on this and you can find this in books and things like that but a lot of it is just to play with it. So thatfs it. You would find some theory on this and itfs, you know, itfs mildly interesting and things like that. I mean, this would -- youfd find some -- go find some thing from the e70s or e80s or something like that or in some book that would tell you, gIf you did this youfd guarantee super linear convergence.h I guess my response to that is, you know, wefre not really trying to achieve super linear convergence; wefre trying to solve problems with 10 million variables. [Student:]Right. So you [inaudible] tends to take, like, a long time. Is there -- It shouldnft. Why would it take a long time? [Student:]Because essentially, I guess, it just starts, like, has to do a lot of queries to get to the, like, I guess -- It shouldnft. [Student:]-- it [inaudible] with your criteria for -- It should not. [Student:]It should? Yeah, a general rule of thumb is, you know, you said beta equals a half, so your step -- every time you query youfre divided by equals two. If you do five or six of those thatfs too many. And the average number of lines or steps you should be doing is like three or something, two, no more, often one point something. So Ifm suspicious of that. And Ifm just telling you sort of what people experience. Yeah, thatfs not -- and in any case you have to evaluate F. Youfre gonna evaluate it at every step here. Not every CG step, right? But every outer iteration youfre gonna evaluate the gradient of F and I havenft added that in to the cost here. So, yeah, that would come up. But, yeah, you shouldnft be doing a lot of line search thing. Right? I mean, first of all, Newtonfs best when you get into the end game youfre doing none. I mean, if itfs really Newton you should be doing none. And that will translate over here. When you get down in the region of quadratic convergence with a method like this you shouldnft be -- even though you donft have the exact with the Newton direction which is aiming you right to the solution, itfs a good enough direction that a full step should be taken. So itfs actually kind of a bad sign if youfre doing more than a couple of Newton steps. And if later in the algorithm -- Ifm sorry, backtracking a second. Later in the algorithm if youfre doing lots of backtracking steps it doesnft sound right. Is that a specific problem youfre thinking of? [Student:]Itfs just -- yeah, I guess. Okay, wefll talk about it later. Okay, now you also have the question of CG initialization. One way is to initialize with delta X=0. It turns out in that case the first step if you go back and look at the CG thing is you -- well, actually itfs very simple, you solve Ax=b. In the first step you minimize over the line span through B. B in this case is minus the gradient. So the first -- after one step of CG to solve a Newton Method if you start from 0 what pops out is a very interesting thing. Itfs something -- itfs along the negative gradient direction. So if you take CG and you said N max=1 which means do only one step then it turns out that the steps popped out by CG in fact are scaled versions of the negative gradient. So -- and in fact, you can prove things like this, this is -- that might even be a homework problem on that in which case wefll assign it to you or maybe -- I forgot. Or maybe therefre -- if therefs not a word problem maybe there should be one or something like that. You can prove the following, that when you run Newtonfs Method, sorry, CG to do an approximate Newton Method every step of CG takes your angle of your search direction closer to the Newton direction. Everybody see what Ifm saying? So let me show you the -- how that works. So here is -- letfs just make it quadratic. I mean, it doesnft really matter. Here you are, you donft know itfs quadratic, letfs say. And you are right here, okay? Now the negative gradient direction says search in that direction, right? But the Newton direction says, gNo, I think actually thatfs a better step.h Thatfs -- well, thatfs Newton, right? If itfs quadratic it nails it in one step. And what you can show is this, that if you run CG starting from 0 your first step will be in this direction and every other step is gonna actually close -- is simply -- theyfre gonna simply move over here in angle towards the Newton thing. Okay? So everybody see the idea here? So basically you bend -- you can even think of CG as sort of, you know, modifying your step taking into account more and more of the curvature, in fact. Thatfs a good model for what it is. Okay, now another option is this, you can choose -- you can actually use the previous one and do a warm start. Now one problem with that is that if you start with zero every step of CG it will be a dissent direction and this might not be the case here. But you can give it an advantage if youfre only gonna do ten steps, if thatfs what your tuning suggestion, you can actually do really well by just using the previous one. And the simple scheme is something like this, if the previous step is a dissent direction for the current point use it, initialize this way, otherwise you initialize from zero and these are just sort of schemes. So letfs look at an example. It is L2-regularized logistic regression. So herefs my problem. Itfs gonna look like this, and if you want to know what this is, is I am fitting the coefficients in a logistic model right here and therefs L1-regularization over here. So -- and what that means is you have a logistic model, you have binary Boolean outcome and I have a vector of features and I have a giant pile of data that says, gHere are the features and herefs the outcome like plus or minus 1.h You know, like the patient died or didnft and I give you -- or, you know, the stock price went up or it went down. And I give you -- letfs say, oh, to make it interesting a million features. Okay? So -- and I give you as a thousand samples. Now obviously thatfs horrendously over fit, youfre over fit by a 1000 to 1. If I give you a 1000 samples a thousand patient records and a million features. So what the L -- hey, wait a minute here. Sorry, Ifm gonna have to go back and cut out that whole discussion. I thought that was L1. So, all right. Wefll just rewind and skip all that, sorry. We just gonna do L2-regularized. Actually, you can solve L1 -- if you can do this one, you can do that one. So, all right. Wefll just do L2-regularized. Sorry, pardon me. L1-regularized would allow you to do it with a million features. It would run and it would come back with 37 features and it would say, gHere are the 37 out of the million that look interesting. These are the ones that I can make a logistic model and predict these thousand medical outcomes wellh or something like that. Okay? Sorry, this is not that. This is just an example. Okay, all right. So the Hessian and gradient looked like this. I mean, they start looking very familiar, itfs [inaudible] equals DA plus 2 times lambda. Lambda is diagonal here. The gradient looks very similar. And none of the details matter but the important point is all you have to be able to do is multiply by the Hessian, multiply the vector by the Hessian. When you do that you have to do this and the key is here you donft even form this matrix because this thing itfs easy to make up cases where I can store A but I canft store A transposed DA let alone can I even think about factorizing it, thatfs a joke. But this is a case where I canft even store it let alone factorize it. But anyway, I make this -- I multiply it this way so I need two methods, I need a method -- by the way, this often corresponds to a forward model method and thatfs a reverse or adjoint call. So I have a forward model call and an adjoint call and Ifm gonna make two. For each Ifm gonna make a forward model call and an adjoint call per CG step. And wefll just make an example here with 10,000 features and 20,000 samples. So here we just may have a problem where these XIfs have random sparcity pattern. They have around 10 nonzero entries and then in the nonzero entries it just shows at random. None of this matters. But you end up with about, I guess half a million nonzeros in this and we made this small enough that we could actually do the symbolic factorization just to know what wefre talking here. And so the factorization gave us 30 million nonzeros in the Cholesky factor. Oh, by the way, thatfs something we can handle but as you can see this method will be like just orders of magnitude different. Okay, so the method is Newton which wefll look at in a minute and then truncated Newton with various things for stopping. Basically wefre max iterating out because wefre asking for .01 percent error and herefs the convergence versus iteration. So these are actually iterations and the Newton and the 250 step CG are right on top of each other. So the whole problem takes like -- well, I mean, it depends what your accuracy is but, you know, it takes like 15 steps. This is what we expect from our friend Newton, right? This is exactly the kind of thing we expect. Now, by the way, the cost of a Newton step versus the cost of 250 CGfs is like just orders of magnitude off. The Cholesky factor had 30 million nonzeros. Okay? The CG -- the original problem had something like 100,000 nonzeros or something like that. So youfre just way, way, way off by orders of magnitude. The time for these -- you can see here that if you do only 50 CG steps you start losing a little bit, but not much. Now the ten is actually -- this is very typical. What happens in the ten is youfre making excellent progress and then down here thatfs this one, you actually -- you stall. And the theory says thatfll keep going down but we donft have time to wait for it. So thatfs the idea. Now the right way to do this is actually to look at the convergence versus cumulative CG steps and I see a totally different thing. Oh, by the way, Newton step would be like, I donft know, probably over in my office over there, the first Newton step in terms of effort would probably be, I donft know, wefd have to check but probably way, way, way, a kilometer that way. Okay? Just to put these things in perspective. Now, letfs see here, what you can see if very, very cool. This guy that does ten CG steps, which is hardly gonna win at any prizes for, like, good approximation of the Newton step is doing, like, amazingly well and then it just jams. By the way, if youfre happy with a gradient ten to the minus five or whatever, like, this is ridiculous. Cumulative number of CG steps is 100 and so youfre solving a problem that I donft know how long it would take with the direct method but this one might be in the order of like 20-30 minutes or something like that. Youfre now back to solving it in milliseconds. So or youfre just orders and orders of magnitude faster. So here -- oh, herefs some time so we can actually get the times. So, letfs see, so if you do 50 or 250 youfre basically the same. So herefs Newton Method with N max and it jams near this gradient 10 to the minus 6 but that often is just fine. Oh, so here are the time, itfs 1600 as opposed -- 1600 seconds so itfs not bad, itfs half an hour, something like that, as opposed to four seconds, okay? So these numbers are -- these are pretty good factors here. These are worth going after these factors. So these are just much, much, much faster methods. Thatfs -- yeah, this one that jams or something like that. So but youfre already way off. I mean, youfre down in the sub, sub, sub minute range. If you just do diagonal PCG herefs the same picture and I think now itfs simple. Now therefs absolutely no doubt what the best thing to do is and itfs right here. You just do diagonally preconditioned CG to do your Newton method. Ten steps. Now thatfs ridiculous. If someone says, gWhat are you doing?h You say, gIfm doing a crappy job on Newtonfs Method.h And you say, gReally? Howfs it crappy?h And you go, gIfm using an iterative method to calculate the search direction.h You say, gWell, whatfs your iterative method?h You say, gIfm doing ten steps of CG and then I quit. Then whatever I have at that point I pass it to the hieroalgram to do a line search.h Then you say, gOh, do you have some theory that says that worked?h And you go, gOh, yeah, if I were to do 100,000 steps in exact arithmetic I would have computed the Newton step.h So, I mean no one can possibly justify why ten steps is enough here. I mean, you just canft do it, so thatfs it. But these things work done now unbelievably well, and now that means you can scale. That means you can do a problem with 100 million variables like no problem. So here are the numbers for this and theyfre kind of silly. It takes 1600 and then it goes down to 3 seconds or it would be 5 seconds as opposed to, like, 45 minutes or whatever that is. So these are real factors here. And by the way, if you make the problem bigger this goes up much faster than -- this will just go up linearly so youfll just -- if you were to make it ten times bigger you could solve a problem -- and this would be a minute and that would probably scale much worse than linearly. So okay, and you can use this for many, many, many, many things. You can use it for barrier and primal-dual methods and I think what wefre gonna do is quit here and sort of -- and finish up this lecture next time. So wefll quit here.