Okay, good morning. Welcome back. Just one quick announcement for today, which is that this next discussion section as far as for the TAfs will mostly be on, sort of, a tutorial on Matlab and Octaves. So I know many of you already have program Matlab or Octave before, but in case not, and you want to, sort of, see along the tutorial on how direct terms and Matlab, please come to this next discussion section. What I want to do today is continue our discussion of Nai"ve Bayes, which is the learning algorithm that I started to discuss in the previous lecture and talk about a couple of different event models in Nai"ve Bayes, and then Ifll take a brief digression to talk about neural networks, which is something that I actually wonft spend a lot of time on, and then I want to start to talk about support vector machines, and support vector machines is the learning algorithms, the supervised learning algorithm that many people consider the most effective, off-the-shelf supervised learning algorithm. That point of view is debatable, but there are many people that hold that point of view, and wefll start discussing that today, and this will actually take us a few lectures to complete. So letfs talk about Nai"ve Bayes. To recap from the previous lecture, I started off describing spam classification as the most [inaudible] example for Nai"ve Bayes in which we would create feature vectors like these, right, that correspond to words in a dictionary. And so, you know, based on what words appear in a piece of email were represented as a feature vector with ones and zeros in the corresponding places, and Nai"ve Bayes was a generative learning algorithm, and by that I mean itfs an algorithm in which we model PFX given Y, and for Nai"ve Bayes, specifically, we modeled it as product from I equals one to N, PFXI given Y, and also we model PFY, and then we use Bayes Rule, right, to combine these two together, and so our predictions, when you give it a new piece of email you want to tell if itfs spam or not spam, you predict RFX over Y, PFY given X, which by Bayes Rule is RFX over Y, PFX given Y, times BY, okay? So this is Nai"ve Bayes, and just to draw attention to two things, one is that in this model, each of our features were zero, one, so indicating whether different words appear, and the length or the feature vector was, sort of, the length N of the feature vector was the number of words in the dictionary. So it might be on this version on the order of 50,000 words, say. What I want to do now is describe two variations on this algorithm. The first one is the simpler one, which itfs just a generalization to if XI takes on more values. So, you know, one thing thatfs commonly done is to apply Nai"ve Bayes to problems where some of these features, XI, takes on K values rather than just two values, and in that case, you actually build, sort of, a very similar model where PFX given Y is really the same thing, right, where now these are going to be multinomial probabilities rather than Bernoullifs because the XIfs can, maybe, take on up to K values. It turns out, the situation where -- one situation where this arises very commonly is if you have a feature thatfs actually continuous valued, and you choose to dispertise it, and you choose to take a continuous value feature and dispertise it into a finite set of K values, and so itfs a perfect example if you remember our very first supervised learning problem of predicting the price of houses. If you have the classification problem on these houses, so based on features of a house, and you want to predict whether or not the house will be sold in the next six months, say. Thatfs a classification problem, and once you use Nai"ve Bayes, then given a continuous value feature like the living area, you know, one pretty common thing to do would be take the continuous value living area and just dispertise it into a few -- discreet buckets, and so depending on whether the living area of the house is less than 500 square feet or between 1,000 and 1500 square feet, and so on, or whether itfs greater than 2,000 square feet, you choose the value of the corresponding feature, XI, to be one, two, three, or four, okay? So that was the first variation or generalization of Nai"ve Bayes I wanted to talk about. I should just check; are there questions about this? Okay. Cool. And so it turns out that in practice, itfs fairly common to use about ten buckets to dispertise a continuous value feature. I drew four here only to save on writing. The second and, sort of, final variation that I want to talk about for Nai"ve Bayes is a variation thatfs specific to classifying text documents, or, more generally, for classifying sequences. So the text document, like a piece of email, you can think of as a sequence of words and you can apply this, sort of, model Ifm about to describe to classifying other sequences as well, but let me just focus on text, and herefs the idea. So the Nai"ve Bayes algorithm as Ifve described it so far, right, given a piece of email, we were representing it using this binary vector value representation, and one of the things that this loses, for instance, is the number of times that different words appear, all right? So, for example, if some word appears a lot of times, and you see the word, you know, gbuyh a lot of times. You see the word gViagrah; it seems to be a common email example. You see the word Viagra a ton of times in the email, it is more likely to be spam than it appears, I guess, only once because even once, I guess, is enough. So let me just try a different, whatfs called an event model for Nai"ve Bayes that will take into account the number of times a word appears in the email, and to give this previous model a name as well this particular model for text classification is called the Multivariate Bernoulli Event Model. Itfs not a great name. Donft worry about what the name means. It refers to the fact that there are multiple Bernoulli random variables, but itfs really -- donft worry about what the name means. In contrast, what I want to do now is describe a different representation for email in terms of the feature vector, and this is called the Multinomial Event Model, and, again, there is a rationale behind the name, but itfs slightly cryptic, so donft worry about why itfs called the Multinomial Event Model; itfs just called that. And herefs what wefre gonna do, given a piece of email, Ifm going to represent my email as a feature vector, and so my IF training example, XI will be a feature vector, XI sub group one, XI sub group two, XI subscript NI where NI is equal to the number of words in this email, right? So if one of my training examples is an email with 300 words in it, then I represent this email via a feature vector with 300 elements, and each of these elements of the feature vector -- lets see. Let me just write this as X subscript J. These will be an index into my dictionary, okay? And so if my dictionary has 50,000 words, then each position in my feature vector will be a variable that takes on one of 50,000 possible values corresponding to what word appeared in the J position of my email, okay? So, in other words, Ifm gonna take all the words in my email and you have a feature vector that just says which word in my dictionary was each word in the email, okay? So a different definition for NI now, NI now varies and is different for every training example, and this XJ is now indexed into the dictionary. You know, the components of the feature vector are no longer binary random variables; theyfre these indices in the dictionary that take on a much larger set of values. And so our generative model for this will be that the joint distribution over X and Y will be that, where again N is now the length of the email, all right? So the way to think about this formula is you imagine that there was some probably distribution over emails. Therefs some random distribution that generates the emails, and that process proceeds as follows: First, Y is chosen, first the class label. Is someone gonna send you spam email or not spam emails is chosen for us. So first Y, the random variable Y, the class label of spam or not spam is generated, and then having decided whether they sent you spam or not spam, someone iterates over all 300 positions of the email, or 300 words that are going to compose them as email, and would generate words from some distribution that depends on whether they chose to send you spam or not spam. So if they sent you spam, theyfll send you words -- theyfll tend to generate words like, you know, buy, and Viagra, and whatever at discounts, sale, whatever. And if somebody chose to send you not spam, then theyfll send you, sort of, the more normal words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the previous event model, and N has a very different definition from the previous event model. And so the parameters of this model are -- letfs see. Phi subscript K given Y equals one, which is the probability that, you know, conditioned on someone deciding to spend you spam, whatfs the probability that the next word they choose to email you in the spam email is going to be word K, and similarly, you know, sort of, same thing -- well, Ifll just write it out, I guess -- and Phi Y and just same as before, okay? So these are the parameters of the model, and given a training set, you can work out the maximum likelihood estimates of the parameters. So the maximum likelihood estimate of the parameters will be equal to -- and now Ifm gonna write one of these, you know, big indicator function things again. Itfll be a sum over your training sets indicator whether that was spam times the sum over all the words in that email where N subscript I is the number of words in email I in your training set, times indicator XIJ, SK times that I, okay? So the numerator says sum over all your emails and take into account all the emails that had class label one, take into account only of the emails that were spam because if Y equals zero, then this is zero, and this would go away, and then times sum over all the words in your spam email, and it counts up the number of times you observed the word K in your spam emails. So, in other words, the numerator is look at all the spam emails in your training set and count up the total number of times the word K appeared in this email. The denominator then is sum over I into our training set of whenever one of your examples is spam, you know, sum up the length of that spam email, and so the denominator is the total length of all of the spam emails in your training set. And so the ratio is just out of all your spam emails, what is the fraction of words in all your spam emails that were word K, and thatfs your estimate for the probability of the next piece of spam mail generating the word K in any given position, okay? At the end of the previous lecture, I talked about LaPlace smoothing, and so when you do that as well, you add one to the numerator and K to the denominator, and this is the LaPlace smoothed estimate of this parameter, okay? And similarly, you do the same thing for -- and you can work out the estimates for the other parameters yourself, okay? So itfs very similar. Yeah? [Student:]Ifm sorry. On the right on the top, I was just wondering what the X of I is, and what the N of -- Right. So in this second event model, the definition for XI and the definition for N are different, right? So here -- well, this is for one example XY. So here, N is the number of words in a given email, right? And if itfs the I email subscripting then this N subscript I, and so N will be different for different training examples, and here XI will be, you know, these values from 1 to 50,000, and XI is essentially the identity of the Ith word in a given piece of email, okay? So thatfs why this is grouping, or this is a product over all the different words of your email of their probability the Ith word in your email, conditioned on Y. Yeah? [Student:][Off mic]. Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, and I shouldnft have used K here. Let me use a different alphabet and see if that makes sense; does that make sense? Oh, you know what, Ifm sorry. Youfre absolutely right. Thank you. All right. So in LaPlace smoothing, that shouldnft be K. This should be, you know, 50,000, if you have 50,000 words in your dictionary. Yeah, thanks. Great. I stole notation from the previous lecture and didnft translate it properly. So LaPlace smoothing, right, this is the number of possible values that the random variable XI can take on. Cool. Raise your hand if this makes sense? Okay. Some of you, are there more questions about this? Yeah. [Student:]On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Yeah, letfs see. So LaPlace smoothing is a method to give you, sort of, hopefully, better estimates of their probability distribution over a multinomial, and so was I using X to Y in the previous lecture? So in trying to estimate the probability over a multinomial -- I think X and Y are different. I think -- was it X or Y? I think it was X, actually. Well -- oh, I see, right, right. I think I was using a different definition for the random variable Y because suppose you have a multinomial random variable, X which takes on -- letfs use a different alphabet. Suppose you have a multinomial random variable X which takes on L different values, then the maximum likelihood estimate for the probability of X, PFX equals K, will be equal to, right, the number of observations. The maximum likelihood estimate for the probability of X being equal to K will be the number of observations of X equals K divided by the total number of observations of X, okay? So thatfs the maximum likelihood estimate. And to add LaPlace smoothing to this, you, sort of, add one to the numerator, and you add L to the denominator where L was the number of possible values that X can take on. So, in this case, this is a probability that X equals K, and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be something else, but thatfs why I add 50,000 to the denominator. Are there other questions? Yeah. [Student:]Is there a specific definition for a maximum likelihood estimation of a parameter? Wefve talked about it a couple times, and all the examples make sense, but I donft know what the, like, general formula for it is. I see. Yeah, right. So the definition of maximum likelihood -- so the question is whatfs the definition for maximum likelihood estimate? So actually in todayfs lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I was, sort of, throwing out the maximum likelihood estimates on the board without proving them. The way to actually work this out is to actually write down the likelihood. So the way to figure out all of these maximum likelihood estimates is to write down the likelihood of the parameters, phi K given Y being zero, phi Y, right? And so given a training set, the likelihood, I guess, I should be writing log likelihood will be the log of the product of I equals one to N, PFXI, YI, you know, parameterized by these things, okay? Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are parameterized by -- well, Ifll just drop the parameters to write this more simply -- oh, I just put it in -- times PFYI, okay? So this is my log likelihood, and so the way you get the maximum likelihood estimate of the parameters is you -- so if given a fixed training set, given a set of fixed IYIfs, you maximize this in terms of these parameters, and then you get the maximum likelihood estimates that Ifve been writing out. So in a previous section of todayfs lecture I wrote out some maximum likelihood estimates for the Gaussian Discriminant Analysis model, and for Nai"ve Bayes, and then this -- I didnft prove them, but you get to, sort of, play with that yourself in the homework problem as well and for one of these models, and youfll be able to verify that when you maximize the likelihood and maximize the log likelihood that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to find the way these are derived is by maximizing this, okay? Cool. All right. So that wraps up what I wanted to say about -- oh, so that, more or less, wraps up what I wanted to say about Nai"ve Bayes, and it turns out that for text classification, the Nai"ve Bayes algorithm with this second event model, the last Nai"ve Bayes model I presented with the multinomial event model, it turns out that almost always does better than the first Nai"ve Bayes model I talked about when youfre applying it to the specific case -- to the specific of text classification, and one of the reasons is hypothesized for this is that this second model, the multinomial event model, takes into account the number of times a word appears in a document, whereas the former model doesnft. I should say that in truth that actually turns out not to be completely understood why the latter model does better than the former one for text classification, and, sort of, researchers are still debating about it a little bit, but if you ever have a text classification problem, you know, Nai"ve Bayes Classify is probably not, by far, the best learning algorithm out there, but it is relatively straightforward to implement, and itfs a very good algorithm to try if you have a text classification problem, okay? Still a question? Yeah. [Student:]So the second model is still positioning a variant, right? It doesnft actually care where the words are. Yes, all right. [Student:]And, I mean, X variable, if my model like you had exclamation in, does that usually do better if you have enough data? Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesnft care about the ordering of the words. You can shuffle all the words in the email, and it does exactly the same thing. So in natural language processing, therefs actually another name; itfs called a Unique Grand Model in natural language processing, and therefs some other models like, sort of, say, higher order markup models that take into account some of the ordering of the words. It turns out that for text classification, the models like the bigram models or trigram models, I believe they do only very slightly better, if at all, but thatfs when youfre applying them to text classification, okay? All right. So the next thing I want to talk about is to start again to discussion of non-linear classifiers. So it turns out -- well, and so the very first classification algorithm we talked about was logistic regression, which had the forming form for hypothesis, and you can think of this as predicting one when this estimator probability is greater or equal to 0.5 and predicting zero, right, when this is less than 0.5, and given a training set, right? Logistic regression will maybe do grade and descends or something or use Newtonfs method to find a straight line that reasonably separates the positive and negative classes. But sometimes a data set just canft be separated by a straight line, so is there an algorithm that will let you start to learn these sorts of non-linear division boundaries? And so how do you go about getting a non-linear classifier? And, by the way, one cool result is that remember when I said -- when we talked about generative learning algorithms, I said that if you assume Y given X is exponential family, right, with parameter A, and if you build a generative learning algorithm using this, right, plus one, if this is A to one. This is exponential family with natural parameter A to zero, right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, then you end up with a logistic posterior. It actually turns out that a Nai"ve Bayes model actually falls into this as well. So the Nai"ve Bayes model actually falls into this exponential family as well, and, therefore, under the Nai"ve Bayes model, youfre actually using this other linear classifier as well, okay? So the question is how can you start to get non-linear classifiers? And Ifm going to talk about one method today which is -- and we started to talk about it very briefly which is taking a simpler algorithm like logistic regression and using it to build up to more complex non-linear classifiers, okay? So to motivate this discussion, Ifm going to use the little picture -- letfs see. So suppose you have features X1, X2, and X3, and so by convention, Ifm gonna follow our earlier convention that X0 is set to one, and so Ifm gonna use a little diagram like this to denote our logistic regression unit, okay? So think of a little picture like that, you know, this little circle as denoting a computation note that takes this input, you know, several features and then it outputs another number thatfs X subscript theta of X, given by a sigmoid function, and so this little computational unit -- well, will have parameters theta. Now, in order to get non-linear division boundaries, all we need to do -- well, at least one thing to do is just come up with a way to represent hypotheses that can output non-linear division boundaries, right, and so this is -- when you put a bunch of those little pictures that I drew on the previous board, you can then get whatfs called a Neural Network in which you think of having my features here and then I would feed them to say a few of these little sigmoidal units, and these together will feed into yet another sigmoidal unit, say, which will output my final output H subscript theta of X, okay? And just to give these things names, let me call the values output by these three intermediate sigmoidal units; let me call them A1, A2, A3. And let me just be completely concrete about what this formula represents, right? So each of these units in the middle will have their own associated set of parameters, and so the value A1 will be computed as G of X transpose, and then some set of parameters, which Ifll write as theta one, and similarly, A2 will be computed as G of X transpose theta two, and A3 will be G of X transpose, theta three, where G is the sigmoid function, all right? So G of Z, and then, finally, our hypothesis will output G of A transpose theta four, right? Where, you know, this A vector is a vector of A1, A2, A3. We can append another one to it at first if you want, okay? Let me just draw up here this -- Ifm sorry about the cluttered board. And so H subscript theta of X, this is a function of all the parameters theta one through theta four, and so one way to learn parameters for this model is to write down the cost function, say, J of theta equals one-half sum from Y equals one to M, YI minus H subscript theta of XI squared, say. Okay, so thatfs our familiar quadratic cost function, and so one way to learn the parameters of an algorithm like this is to just use gradient interscent to minimize J of theta as a function of theta, okay? See, in the phi gradient descent to minimize this square area, which stated differently means you use gradient descent to make the predictions of your neural network as close as possible to what you observed as the labels in your training set, okay? So it turns out green descent on this neural network is a specific name, the algorithm that implements grand descent is called back propagation, and so if you ever hear that all that means is -- it just means gradient interscent on a cost function like this or a variation of this on the neural network that looks like that, and -- well, this algorithm actually has some advantages and disadvantages, but let me actually show you. So, letfs see. One of the interesting things about the neural network is that you can look at what these intermediate notes are computing, right? So this neural network has whatfs called a hidden layer before you then have the output layer, and, more generally, you can actually have inputs feed into these computation units, feed into more layers of computation units, to even more layers, to more layers, and then finally you have an output layer at the end And one cool thing you can do is look at all of these intermediate units, look at these units and whatfs called a hidden layer of the neural network. Donft worry about why itfs called that. Look at computations of the hidden unit and ask what is the hidden unit computing the neural network? So to, maybe, get a better sense of neural networks might be doing, let me show you a video -- Ifm gonna switch to the laptop -- this is made by a friend, Yann LeCun whofs currently a professor at New York University. Can I show a video on the laptop? So let me show you a video from Yann LeCun on a neural network that he developed for Hammerton Digit Recognition. There was one other thing he did in this neural network that Ifm not gonna talk about called a Convolutional Neural Network that -- well, his system is called LeNet, and letfs see. Would you put on the laptop display? Hum, actually maybe if -- or you can just put on the screen on the side; that would work too if the big screen isnft working. Letfs see. Ifm just trying to think, okay, how do I keep you guys entertained while wefre waiting for the video to come on? Well, let me say a few more things about neural network. So it turns out that when you write a quadratic cost function like I wrote down on the chalkboard just now, it turns out that unlike logistic regression, that will almost always respond to non-convex optimization problem, and so whereas for logistic regression if you run gradient descent or Newtonfs method or whatever, you converse the global optimer. This is not true for neural networks. In general, there are lots of local optimer and, sort of, much harder optimization problem. So neural networks, if youfre, sort of, familiar with them, and youfre good at making design choices like what learning rate to use, and how many hidden units to use, and so on, you can, sort of, get them to be fairly effective, and therefs, sort of, often ongoing debates about, you know, is this learning algorithm better, or is that learning algorithm better? The vast majority of machine learning researchers today seem to perceive support vector machines, which is what Ifll talk about later, to be a much more effective off-the-shelf learning algorithm than neural networks. This point of view is contested a bit, but so neural networks are not something that I personally use a lot right now because therefs a hard optimization problem and you should do so often verge, and it actually, sort of works. It, sort of, works reasonably well. Itfs just because this is fairly complicated, therefs not an algorithm that I use commonly or that my friends use all time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years, you know, the most effective learning algorithm before support vector machines were invented. So herefs Yann LeCunfs video, and -- well, therefs actually audio on this too, the soundboard. So Ifll just tell you whatfs happening. What youfre seeing is a trained neural network, and this display where my mouse pointer is pointing, right, this big three there is the input to the neural network. So youfre showing the neural network this image, and itfs trying to recognize what is this. The final answer output by the neural network is this number up here, right below where it says LeNet-5, and the neural network correctly recognizes this image as a three, and if you look to the left of this image, whatfs interesting about this is the display on the left portion of this is actually showing the intermediate computations of the neural network. In other words, itfs showing you what are the hidden layers of the neural network computing. And so, for example, if you look at this one, the third image down from the top, this seems to be computing, you know, certain edges into digits, right? Wefre just computing digits on the right-hand side of the bottom or something of the input display of the input image, okay? So let me just play this video, and you can see some of the inputs and outputs of the neural network, and those are very different fonts. Therefs this robustness to noise. All right. Multiple digits, thatfs, kind of, cool. All right. So, just for fun, let me show you one more video, which was -- letfs see. This is another video from the various CVfs, the machine that changed the world, which was produced by WGBH Television in corporation with British Foreclass Incorporation, and it was aired on PBS a few years ago, I think. I want to show you a video describing the NETtalk Neural Network, which was developed by Terry Sejnowski; hefs a researcher. And so NETtalk was actually one of the major milestones in the history of neural network, and this specific application is getting the neural network to read text. So, in other words, can you show a piece of English to a computer and have the computer read, sort of, verbally produce sounds that could respond to the reading of the text. And it turns out that in the history of AI and the history of machine learning, this video created a lot of excitement about neural networks and about machine learning. Part of the reason was that Terry Sejnowski had the foresight to choose to use, in his video, a child-like voice talking about visiting your grandmotherfs house and so on. Youfll see it in a second, and so this really created the perception of -- created the impression of the neural network being like a young child learning how to speak, and talking about going to your grandmothers, and so on. So this actually helped generate a lot of excitement within academia and outside academia on neural networks, sort of, early in the history of neural networks. Ifm just gonna show you the video. [Begin Video] Youfre going to hear first what the network sounds like at the very beginning of the training, and it wonft sound like words, but itfll sound like attempts that will get better and better with time. [Computerfs voice] The network takes the letters, say the phrase, ggrandmotherfs house,h and makes a random attempt at pronouncing it. [Computerfs voice] Grandmotherfs house. The phonetic difference between the guess and the right pronunciation is sent back through the network. [Computerfs voice] Grandmotherfs house. By adjusting the connection strengths after each attempt, the net slowly improves. And, finally, after letting it train overnight, the next morning it sounds like this: Grandmotherfs house, Ifd like to go to my grandmotherfs house. Well, because she gives us candy. Well, and we -- NETtalk understands nothing about the language. It is simply associating letters with sounds. [End Video] All right. So at the time this was done, I mean, this is an amazing piece of work. I should say today there are other text to speech systems that work better than what you just saw, and youfll also appreciate getting candy from your grandmotherfs house is a little bit less impressive than talking about the Dow Jones falling 15 points, and profit taking, whatever. So but I wanted to show that just because that was another cool, major landmark in the history of neural networks. Okay. So letfs switch back to the chalkboard, and what I want to do next is tell you about Support Vector Machines, okay? That, sort of, wraps up our discussion on neural networks. So I started off talking about neural networks by motivating it as a way to get us to output non-linear classifiers, right? I donft really approve of it. It turns out that youfd be able to come up with non-linear division boundaries using a neural network like what I drew on the chalkboard earlier. Support Vector Machines will be another learning algorithm that will give us a way to come up with non-linear classifiers. Therefs a very effective, off-the-shelf learning algorithm, but it turns out that in the discussion Ifm gonna -- in the progression and development Ifm gonna pursue, Ifm actually going to start off by describing yet another class of linear classifiers with linear division boundaries, and only be later, sort of, in probably the next lecture or the one after that, that wefll then take the support vector machine idea and, sort of, do some clever things to it to make it work very well to generate non-linear division boundaries as well, okay? But wefll actually start by talking about linear classifiers a little bit more. And to do that, I want to convey two intuitions about classification. One is you think about logistic regression; we have this logistic function that was outputting the probability that Y equals one, and it crosses this line at zero. So when you run logistic regression, I want you to think of it as an algorithm that computes theta transpose X, and then it predicts one, right, if and only if, theta transpose X is greater than zero, right? IFF stands for if and only if. It means the same thing as a double implication, and it predicts zero, if and only if, theta transpose X is less than zero, okay? So if itfs the case that theta transpose X is much greater than zero, the double greater than sign means these are much greater than, all right. So if theta transpose X is much greater than zero, then, you know, think of that as a very confident prediction that Y is equal to one, right? If theta transpose X is much greater than zero, then wefre gonna predict one then moreover wefre very confident itfs one, and the picture for that is if theta transpose X is way out here, then wefre estimating that the probability of Y being equal to one on the sigmoid function, it will be very close to one. And, in the same way, if theta transpose X is much less than zero, then wefre very confident that Y is equal to zero. So wouldnft it be nice -- so when we fit logistic regression of some of the classifiers is your training set, then so wouldnft it be nice if, right, for all I such that Y is equal to one. We have theta transpose XI is much greater than zero, and for all I such that Y is equal to zero, we have theta transpose XI is much less than zero, okay? So wouldnft it be nice if this is true? That, essentially, if our training set, we can find parameters theta so that our learning algorithm not only makes correct classifications on all the examples in a training set, but further itfs, sort of, is very confident about all of those correct classifications. This is the first intuition that I want you to have, and wefll come back to this first intuition in a second when we talk about functional margins, okay? Wefll define this later. The second intuition that I want to convey, and it turns out for the rest of todayfs lecture Ifm going to assume that a training set is linearly separable, okay? So by that I mean for the rest of todayfs lecture, Ifm going to assume that there is indeed a straight line that can separate your training set, and wefll remove this assumption later, but just to develop the algorithm, letfs take away the linearly separable training set. And so therefs a sense that out of all the straight lines that separate the training set, you know, maybe that straight line isnft such a good one, and that one actually isnft such a great one either, but maybe that line in the middle is a much better linear separator than the others, right? And one reason that when you and I look at it this one seems best is because this line is just further from the data, all right? That is separates the data with a greater distance between your positive and your negative examples and division boundary, okay? And this second intuition, wefll come back to this shortly, about this final line that I drew being, maybe, the best line this notion of distance from the training examples. This is the second intuition I want to convey, and wefll formalize it later when we talk about geometric margins of our classifiers, okay? So in order to describe support vector machine, unfortunately, Ifm gonna have to pull a notation change, and, sort of, unfortunately, it, sort of, was impossible to do logistic regression, and support vector machines, and all the other algorithms using one completely consistent notation, and so Ifm actually gonna change notations slightly for linear classifiers, and that will actually make it much easier for us -- thatfll make it much easier later today and in next weekfs lectures to actually talk about support vector machine. But the notation that Ifm gonna use for the rest of today and for most of next week will be that my B equals Y, and instead of be zero, one, theyfll be minus one and plus one, and a development of a support vector machine we will have H, have a hypothesis output values to the either plus one or minus one, and so wefll let G of Z be equal to one if Z is greater or equal to zero, and minus one otherwise, right? So just rather than zero and one, we change everything to plus one and minus one. And, finally, whereas previously I wrote G subscript theta of X equals G of theta transpose X and we had the convention that X zero is equal to one, right? And so X is an RN plus one. Ifm gonna drop this convention of letting X zero equals a one, and letting X be an RN plus one, and instead Ifm going to parameterize my linear classifier as H subscript W, B of X equals G of W transpose X plus B, okay? And so B just now plays the role of theta zero, and W now plays the role of the rest of the parameters, theta one through theta N, okay? So just by separating out the interceptor B rather than lumping it together, itfll make it easier for us to develop support vector machines. So -- yes. [Student:][Off mic]. Oh, yes. Right, yes. So W is -- right. So W is a vector in RN, and X is now a vector in RN rather than N plus one, and a lowercase b is a real number. Okay. Now, letfs formalize the notion of functional margin and germesh margin. Let me make a definition. Ifm going to say that the functional margin of the hyper plane WB with respect to a specific training example, XIYI is -- WRT stands for with respect to -- the function margin of a hyper plane WB with respect to a certain training example, XIYI has been defined as Gamma Hat I equals YI times W transpose XI plus B, okay? And so a set of parameters, W, B defines a classifier -- it, sort of, defines a linear separating boundary, and so when I say hyper plane, I just mean the decision boundary thatfs defined by the parameters W, B. You know what, if youfre confused by the hyper plane term, just ignore it. The hyper plane of a classifier with parameters W, B with respect to a training example is given by this formula, okay? And interpretation of this is that if YI is equal to one, then for each to have a large functional margin, you want W transpose XI plus B to be large, right? And if YI is equal minus one, then in order for the functional margin to be large -- we, sort of, want the functional margins to large, but in order for the function margins to be large, if YI is equal to minus one, then the only way for this to be big is if W transpose XI plus B is much less than zero, okay? So this captures the intuition that we had earlier about functional margins -- the intuition we had earlier that if YI is equal to one, we want this to be big, and if YI is equal to minus one, we want this to be small, and this, sort of, practice of two cases into one statement that wefd like the functional margin to be large. And notice this is also that so long as YI times W transpose XY plus B, so long as this is greater than zero, that means we classified it correctly, okay? And one more definition, Ifm going to say that the functional margin of a hyper plane with respect to an entire training set is going to define gamma hat to be equal to min over all your training examples of gamma hat, I, right? So if you have a training set, if you have just more than one training example, Ifm going to define the functional margin with respect to the entire training set as the worst case of all of your functional margins of the entire training set. And so for now we should think of the first function like an intuition of saying that we would like the function margin to be large, and for our purposes, for now, letfs just say we would like the worst-case functional margin to be large, okay? And wefll change this a little bit later as well. Now, it turns out that therefs one little problem with this intuition that will, sort of, edge us later, which it actually turns out to be very easy to make the functional margin large, all right? So, for example, so as I have a classifiable parameters W and B. If I take W and multiply it by two and take B and multiply it by two, then if you refer to the definition of the functional margin, I guess that was what? Gamma I, gamma hat I equals YI times W times transpose B. If I double W and B, then I can easily double my functional margin. So this goal of making the functional margin large, in and of itself, isnft so useful because itfs easy to make the functional margin arbitrarily large just by scaling other parameters. And so maybe one thing we need to do later is add a normalization condition. For example, maybe we want to add a normalization condition that de-norm, the alter-norm of the parameter W is equal to one, and wefll come back to this in a second. All right. And then so -- Okay. Now, letfs talk about -- see how much time we have, 15 minutes. Well, see, Ifm trying to decide how much to try to do in the last 15 minutes. Okay. So letfs talk about the geometric margin, and so the geometric margin of a training example -- [inaudible], right? So the division boundary of my classifier is going to be given by the plane W transpose X plus B is equal to zero, okay? Right, and these are the X1, X2 axis, say, and wefre going to draw relatively few training examples here. Letfs say Ifm drawing deliberately few training examples so that I can add things to this, okay? And so assuming we classified an example correctly, Ifm going to define the geometric margin as just a geometric distance between a point between the training example -- yeah, between the training example XI, YI and the distance given by this separating line, given by this separating hyper plane, okay? Thatfs what Ifm going to define the geometric margin to be. And so Ifm gonna do some algebra fairly quickly. In case it doesnft make sense, and read through the lecture notes more carefully for details. Sort of, by standard geometry, the normal, or in other words, the vector thatfs 90 degrees to the separating hyper plane is going to be given by W divided by the norm of W; thatfs just how planes and high dimensions work. If this stuff -- some of this you have to use, take a look t the lecture notes on the website. And so letfs say this distance is gamma I, okay? And so Ifm going to use the convention that Ifll put a hat on top where Ifm referring to functional margins, and no hat on top for geometric margins. So letfs say geometric margin, as this example, is gamma I. That means that this point here, right, is going to be XI minus gamma I times W over normal W, okay? Because W over normal W is the unit vector, is the length one vector that is normal to the separating hyper plane, and so when we subtract gamma I times the unit vector from this point, XI, or at this point here is XI. So XI minus, you know, this little vector here is going to be this point that Ifve drawn as a heavy circle, okay? So this heavy point here is XI minus this vector, and this vector is gamma I time W over norm of W, okay? And so because this heavy point is on the separating hyper plane, right, this point must satisfy W transpose times that point equals zero, right? Because all points X on the separating hyper plane satisfy the equation W transpose X plus B equals zero, and so this point is on the separating hyper plane, therefore, it must satisfy W transpose this point -- oh, excuse me. Plus B is equal to zero, okay? Raise your hand if this makes sense so far? Oh, okay. Cool, most of you, but, again, Ifm, sort of, being slightly fast in this geometry. So if youfre not quite sure why this is a normal vector, or how I subtracted this, or whatever, take a look at the details in the lecture notes. And so what Ifm going to do is Ifll just take this equation, and Ifll solve for gamma, right? So this equation I just wrote down, solve this equation for gamma or gamma I, and you find that -- you saw that previous equation from gamma I -- well, why donft I just do it? You have W transpose XI plus B equals gamma I times W transpose W over norm of W; thatfs just equal to gamma times the norm of W because W transpose W is the norm of W squared, and, therefore, gamma is just -- well, transpose X equals, okay? And, in other words, this little calculation just showed us that if you have a training example XI, then the distance between XI and the separating hyper plane defined by the parameters W and B can be computed by this formula, okay? So the last thing I want to do is actually take into account the sign of the -- the correct classification of the training example. So Ifve been assuming that wefve been classifying an example correctly. So, more generally, to find the geometric margin of a training example to be gamma I equals YI times that thing on top, okay? And so this is very similar to the functional margin, except for the normalization by the norm of W, and so as before, you know, this says that so long as -- we would like the geometric margin to be large, and all that means is that so long as wefre classifying the example correctly, we would ideally hope of the example to be as far as possible from the separating hyper plane, so long as itfs on the right side of the separating hyper plane, and thatfs what YI multiplied into this does. And so a couple of easy facts, one is if the norm of W is equal to one, then the functional margin is equal to the geometric margin, and you see that quite easily, and, more generally, the geometric margin is just equal to the functional margin divided by the norm of W, okay? Letfs see, okay. And so one final definition is so far Ifve defined the geometric margin with respect to a single training example, and so as before, Ifll define the geometric margin with respect to an entire training set as gamma equals min over I of gamma I, all right? And so the maximum margin classifier, which is a precursor to the support vector machine, is the learning algorithm that chooses the parameters W and B so as to maximize the geometric margin, and so I just write that down. The maximum margin classified poses the following optimization problem. It says choose gamma, W, and B so as to maximize the geometric margin, subject to that YI times -- well, this is just one way to write it, subject to -- actually, do I write it like that? Yeah, fine. There are several ways to write this, and one of the things wefll do next time is actually -- Ifm trying to figure out if I can do this in five minutes. Ifm guessing this could be difficult. Well, so this maximizing your classifier is the maximization problem over parameter gamma W and B, and for now, it turns out that the geometric margin doesnft change depending on the norm of W, right? Because in the definition of the geometric margin, notice that wefre dividing by the norm of W anyway. So you can actually set the norm of W to be anything you want, and you can multiply W and B by any constant; it doesnft change the geometric margin. This will actually be important, and wefll come back to this later. Notice that you can take the parameters WB, and you can impose any normalization constant to it, or you can change W and B by any scaling factor and replace them by ten W and ten B whatever, and it does not change the geometric margin, okay? And so in this first formulation, Ifm just gonna impose a constraint and say that the norm of W was one, and so the function of the geometric margins will be the same, and then wefll say maximize the geometric margins subject to -- you maximize gamma subject to that every training example must have geometric margin at least gamma, and this is a geometric margin because when the norm of W is equal to one, then the functional of the geometric margin are identical, okay? So this is the maximum margin classifier, and it turns out that if you do this, itfll run, you know, maybe about as well as a -- maybe slight -- maybe comparable to logistic regression, but it turns out that as we develop this algorithm further, there will be a clever way to allow us to change this algorithm to let it work in infinite dimensional feature spaces and come up with very efficient non-linear classifiers. So therefs a ways to go before we turn this into a support vector machine, but this is the first step. So are there questions about this? Yeah. [Student:][Off mic]. For now, letfs just say youfre given a fixed training set, and you canft -- yeah, for now, letfs just say youfre given a fixed training set, and the scaling of the training set is not something you get to play with, right? So everything Ifve said is for a fixed training set, so that you canft change the Xfs, and you canft change the Yfs. Are there other questions? Okay. So all right. Next week we will take this, and wefll talk about authorization algorithms, and work our way towards turning this into one of the most effective off-the-shelf learning algorithms, and just a final reminder again, this next discussion session will be on Matlab and Octaves. So show up for that if you want to see a tutorial. Okay. See you guys in the next class.