Okay, wefre experiencing a little technical difficult here thatfs going to be resolved very quickly because Jason went to get the magic piece of equipment, but in the meantime, Ifll just fake it with my big personality up here. Okay, so thingfs that are going on: Pathfinder is due this Friday. So how many people have been making good progress on Pathfinder? You have one of the algorithms already implemented and working? Anybody with one already done? Almost done? Okay, maybe a little bit of a slow start on this one, but -- and then therefs two pieces that you do have to get ready coming in this Friday. Do remember that you can use a late day on it if you absolutely have to, but we donft really recommend it, because it does kinda put you behind getting ready for finals. And our final is at the very end of finals week, so on Friday. So a week and something from today in the 12:15 slot. I donft know what room wefll be scheduled in, but you can guarantee it wonft be Terman, thatfs all I can tell you for sure. And when we do know, wefll let you know. Anything else administratively I need to tell you guys? Okay, Bishop. Student:[Inaudible] There is an extremely limited run optional finals. You can talk your way into one Thursday if you can convince us of the compelling need for said thing. So send an email to Jason. I know a couple of you told me at the beginning of the quarter you were in this bind about being able to make it to the Friday exam. And so we -- against my better judgment, I allowed myself to be talked into that. Okay, all right, Ifm gonna have to just do this kinda blind, because the piece I need is not here, so Ifm going to actually just talk to you about what it says on my slides and then when Jason arrives wefll plug it back in and see whatfs going on. Oh, therefs Jason. There we go. So what Ifm going to do today is actually something kinda fun. And so Ifm gonna take a case study of a particular data structure design, and the one Ifm looking at there is the lexicon. So how many of you actually ever decided to just open up the lexicon file, just to see what was in there at some point during the quarter. We used it a couple of times. Anybody want to tell us what they learned by opening up that file? Anything interesting? Student:Itfs a data structure called a DAWG. Itfs a data structure called a DAWG. Youfre like, well thatfs a pretty funny name for something. And was the code particularly well-[inaudible], easy to read, easy to understand? No. Who wrote that code? She should totally be fired. Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through kinda how it is I came to discover the DAWG and why it is the DAWG was the choice that made sense for what the need pattern we had for the lexicon was. So let me just refresh what the lexicon does. It is a special case of kinda set or map. It is a container where youfre going stick things in, theyfre going to be unique, right? Therefs no need for any sort of duplicate. The keys are always strings. And in fact, theyfre English words, so thatfs going to be an important thing to remember, that theyfre not just arbitrary sequences of letters, right? They have certain patterns to them and certain sizes that might actually be something we could capitalize on. When we know something about the domain, wefre able to kinda special case or tune or data structure to solve that problem very well, in particular. It has no associated value, so it doesnft have dictionary definitions or synonyms or anything like that, right? So it really just is exists or doesnft exist that we were interested in supporting. And it also has, in addition to the kind of being able to add words and check if a word is contained, itfs also this prefixing operation that we had talked a little about in Boggle, and that came back on the midterm. Like, why was prefixing important, right? It was critical for pruning these dead-end paths, these searches on the Boggle board that otherwise would really have taken a lot of time and kinda led to nothing of any value. So what Ifm gonna to talk you through is some of the obvious choices based on data structures we already know about, and see where they would go. So this is a thought experiment. Itfs not that I wrote all these to find out. Actually, I mostly did a lot of this on paper, trying to figure out what options we had and how they would play out and what kind of trade-offs Ifd have to make to make them work. So the simplest case of any kind of collection is to think about whether vector or array can do the job for you. In this case, having a vector thatfs kept in sorted -- alphabetically sorted -- order seems like a pretty good, easy way to get this set up and running and how would it behave? So in order to do a contains (word) on this data structure, what algorithm are you going to use to do your searching? Binary search, right? Itfs in sorted order, right? Wefve got to take advantage of that, right? If we do a linear search Ifm looking for the word gmediocre,h Ifm going to be trucking through, you know, thousands of words before I find it. So definitely want to be using binary search. Equals-equals, comparing my strings, less than, greater than, dividing in half. And so it should run a logarithm time. Logarithm time, one of those greater performers that you learned in the PQ assignment, hopefully. Itfs just basically not measurable on todayfs computer. So even for entries of 100,000, 500,000, basically for free. How do we do contains (prefix)? Can we also do a fast prefix search given this data structure? I hope so right? Thinking about the same kind of binary search. In this case, though, looking at substring, right? So comparing Ifm looking for a three-character substring, only looking at the first three characters, decide which way to go. Once I find something that matches in the first three characters, Ifm done. So again, still using the sorted property to quickly narrow in on where it could be. When itfs time to add a new word to this, if someone asked you to put in the word gabalone,h I gotta use that binary search to find out where to go. So in logarithmic time I can tell you where the new word goes, but then to put it into position, therefs gonna be some shuffling. And thatfs where this operation starts to bog down, given the idea that you might be moving half or more of your items, on average, to make that space for that one to open up. It is going to require a linear time operation to get new words into the data structure. Okay, probably still, though, a reasonable approach, right? These are the operations that get done immense numbers of times in Boggle, right? Youfre doing a ton of contains (word)fs and contains (prefix)fs as youfre exhaustively searching that big board. But the addfs are done kinda once at the beginning. You build that big dictionary and then you subsequently use it. And so if it took a little bit more time to built that thing, sorta an n-squared operation to kinda load it up, well, maybe that would be okay. The other we want to look at is space usage, and thatfs -- and some of this is a little bit of an interesting artifact of the time I was working on this. It turns out space was actually a limiting factor in what we had available to us, in terms of core memories of our machines and what we could take up for our dictionary. In this case, it has per entry the size of the string, which we donft exactly know the string it represented. Itfs an abstraction to us, but we can make the assumption of kinda a simplification, that itfs probably about the length of the string times of the size of the character. And there might be a little bit more housekeeping associated with it, but thatfs a pretty good estimate of what itfs taking. So in fact, itfs the string data itself thatfs being represented in this vector without much else overhead. So if we say words are about average of about eight characters, about eight bytes per word. So if we have 100,000 we expect to have 800,000 bytes invested in this sort of vector arrangement. Excess capacity, a couple other things that might tweak that a little bit, but that gives us at least a good starting point to think about how this compares to alternatives. Okay, so then if we said we know that sortedness -- that binary search is the real advantage of keeping it in sorted order, but the contiguous nature of the array kinda bites us in terms of insert. If we reorganize into that binary search tree to give ourselves the dynamic flexibility of wiring in the left and right halves through pointers, as opposed to contiguous memory, then we know we can also get add to run in logarithmic time. So looking at these operations, contains (word) is a tree search thatfs using equals-equals, left-right, deciding which way to go, and also performs logarithmically if the tree is balanced. Contains (prefix), same deal. Keep looking at the [inaudible], working our way down, knowing which way to go. And then add can also perform in logarithmic time because it does the same search, falling off the bottom and then inserting the new word there, or finding it along the way and not having any work to do. So we can get all our operations done in logarithmic time, which we know to be kinda fast enough to actually not be worth fighting any harder for. Where did we pay for this? Nothing really comes for free, right? Overhead, right? And overhead, in this case, memory is certainly one of the places where therefs going to be a pretty big cost that we now have the string data for each entry, plus a left pointer and a right pointer. And if wefre doing the work to keep it balanced, which we probably ought to do if we want to avoid the degenerate cases of it kinda getting totally out of whack and degenerating into linear time, wefre going to have a couple more bits, you know, bytes thrown into that factor, too. And so if wefre assuming that therefs ten bytes of overhead, four bytes in each of these plus another two for the balance factor, then wefve added ten bytes of overhead on top of the eight bytes for the word. Getting a little bit hefty. Something to worry about. Also in terms of code complexity, which I didnft put a bullet up for, but itfs also worth kinda keeping -- weighing it in your mind also which is that building the fancier, more complicated data structure probably meant you spent more time debugging it, more time developing it, right? You will have -- it will be harder for someone later to maintain and deal with because itfs actually just more complicated code that somebody looking at a sorted vector isnft likely to actually -- like, they want to add a new operation, letfs say they add remove. Youfre able to take a word out of the lexicon. The lexicon doesnft currently support that. Somebody adding that on the sorted vector implementation probably wonft find themselves too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching the subtrees and not destroying the balancing is a pretty complicated operation. So wefve made -- wefve invested in this data structure that actually will keep on creating further development hassles for any modifications that are down the road for this data structure. So it is a big investment in your brainpower to keep it working. So letfs think about what a hash table can do for us. A hash table is the last thing that we looked at on Friday, kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a hashing function. So if I have a hash table that has a certain number of buckets that uses a link list for chaining to handle the collision, then what I would have is my words just scattered around my table, allowing me to have quick access by hashing to the right bucket and then working my way down the bucket to C. So the contains (word) would do a standard hash table lookup. Take the word, you know, gmediate,h run it through the hash function. It says oh, itfs bucket three in this case. I go to bucket three. I walk down the link list. I find it or not. I can tell you whether the word gmediate,h if itfs in the table, had to be in bucket three. No questions asked. In that case, the expected time for that is going to be n over b, where n is the number of entries, b is the number of buckets. If we have buckets set to be in the rough order of the number of entries, then wefre going to get constant access there. Now we want to do contains (prefix). I want to know if therefs any words in this table that begin with gmed.h Where do I look? If I run gmedh through the hash function, do I expect that I would get the same number as other words that begin with gmed,h like gmediateh and gmedianh and gmediator?h [Student:]Yes. You say yes. All right, letfs take that to an extreme. If I expect that if a prefix in any longer words all have the same thing, then should it be the case, for example, I think about the simplest case of a prefix, gm.h Should all the words that begin with gmh hash to the same place? If they did, wefre in trouble, right? Now, if all prefixes did hash to the same place, then what Ifm going to end up with is a really heavily clustered table where ga,h gb,h gc,h you know, gzh -- therefs 26 buckets that have all the gzh words, all the gyh words, all the what not, and none of the other buckets would be used. All right, so in fact, actually, as part of the good behavior of a hash function, wefre hopping that, actually, it does jumble things up. And then a word that looks like one but has another letter added on the end, wefre hoping it goes somewhere totally different, that gmedianh and gmediansh hopefully go to two totally disparate places. That if there were patterns where words that were substrings of each other all got hashed to the same location, we would end up with heavy clustering. Because there are a lot of words that repeat parts of the other words in the dictionary. So in fact, if I want to know if therefs any words that begin with gmed,h I have no a priori information about where to look for them. They could be anywhere. I could hash gmedh and see if gmedh itself was a word, right? But letfs say it was something that itself was just the beginning of something. gSt,h gstrh -- therefs a lot of words that begin with that, but gstrh itself is not a word. So hashing to that bucket and looking for it would only tell me if therefs exactly gstr,h and then occasionally there might be some gstrh words also there. But other than that, we just gotta look. And where do we look? Everywhere. Everywhere, everywhere, everywhere. That in order to conclude there is no word that begins with the prefix we have, the only way to be confident of that is to either find something, or to have searched the entire table and not found it. And so it does require this completely linear, exhaustive look through the entire contents of the hash table to know something, for example, is not a prefix. You might find one quickly. So in the best case, you might be in the first couple of buckets, happen upon one. But no guarantees, overall. And certainly when itfs not a prefix, itfs going to require a complete look-see of everything. Thatfs kinda a bummer. Adding a word, back to the fast behavior again. Hashing in the bucket, checking to see if itfs already there, adding if needed [inaudible]. So it gets good times on these two, but then kinda really bottoms out when it comes to supporting that prefix search. Its space usage is kinda roughly comparable to that of the binary search tree. Wefve got the string data itself, and wefve got the four-byte pointer on the link list. Therefs also the four-byte bucket pointer, which is over in this other structure here, this array full of buckets. If that number, though, is roughly equal to the number of cells, then you can kinda just count it as each entry had this 12-byte cell, plus a four-byte bucket pointer somewhere that you can kinda think of as associated on a per entry basis to tell you itfs about eight bytes of overhead on top of the string. So 16 bytes for each entry in the table. Okay, thatfs what we have so far. Wefve got pretty good performance on contains (word) no matter what we choose. We can get either logarithmic or constant time with either of these three data structure. So perfectly acceptable performance there. We cannot get contains (prefix) to run well on a hash table. Just doesnft happen the way hashing works. And then add can be fast on the two complicated pointer-based data structures to the right, but canft be fast on the sorted vector. And then we see kinda some trade-offs here in terms of space versus time, that therefs very little space overhead in the sorted vector case, but really blossoms into two times, almost three times, the storage needed. In the VST and hash table case, the code also got a lot more complicated when we moved up to those data structures. So then Ifll tell you, actually, that it turns out that the one type of performance that was interesting to us when I was exploring this, but in fact, more important at the time, was the memory usage. That the memory usage was really the big obstacle that we had. The Official Scrabble Players Dictionary II is actually the source for the word list that wefre using. It has about 128,000 words, I think, is exactly the number it has. And the average length of those is eight characters. And the file itself is over a megabyte. So on disk, if you were just to look at the listing of all the words, itfs a megabyte -- a little over a megabyte there. If we were loading it into the sorted vector, wefd get something that was just roughly equal to that. It would take about a megabyte of space, plus a little bit of noise. The ones that double up that take it up to two, two and a half megabytes total. At the time we were working, and I know this is going to seem completely archaic to you, but we had these stone tablets that we worked on. And they typically had main memories of about a megabyte, maybe, like in an exciting case, four megabytes; four whole megabytes of RAM. So it was not possible that we could dedicate one or two megabytes of your main memory to just your lexicon. It just wasnft -- there just wasnft enough space to go around. So wefre working in this very tight environment in terms of that. So a little bit more like, for example, todayfs world of embedded devices. If youfre adding something for an iPod or a cell phone you have a lot less gargantuan memory. So your average desktop machine, having a gigabyte of RAM, now youfre like, whatever, a megabyte here, ten megabytes there, free, all the memory you want. And so this seems kinda like a silly thing to focus on, but this was 1995, and it turns out, yeah, there was certainly more interest in keeping it tight to the memory. The Boggle lexicon that we have actually takes under a third of a megabyte. The memory thatfs used while itfs working and searching and checking for words and prefixes is actually smaller than the on-disk form of the data we started with. So it actually does a kind of a bit of a compression as part of its strategy for storage, which is pretty interesting to think about. So Ifll tell you, actually, the truth about how we first did Boggle, because itfs kinda a little interesting historical perspective on it. But Boggle was given by a colleague of mine for the first time, Todd Feldman. He was the guy who came up with the idea, which is a great idea. And he said, gOh, this is a great assignment.h And when he wrote it, he said, gOh, I need a lexicon.h So as part of kinda putting the assignment together, he was like, gOh, we have to get some lexicons.h So he looked around and he found a word list that had about 20,000 words. And I think we first found this word list and we said, gOh, itfs just impossible.h Itfs going to take a megabyte or more and we just donft have that kind of space. So we need much smaller data files. So we had a data file that had about 20,000 words, which then takes about 200k. And he actually wrote it using a hash table. So given he wrote it as a hash table, he just didnft support contains (prefix). He just didnft put it in, and -- because you canft do it efficiently. So it actually took a smaller amount of space and had a smaller word list and it wouldnft do contains (prefix). So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without contains (prefix). It spends a lot more time looking down these data ends, but it eventually bottoms out when it runs off the board and uses all the cubes. And it actually, in some ways, almost made the program a little more satisfying to write in one odd way, which is, when youfre -- so itfs the computerfs turn, right, that uses the prefix pruning. So you would type in your words and it would be perfectly fine finding it, but it doesnft use the prefix part when itfs doing the humanfs turn. Youfre typing in words; itfs finding them. Youfre typing in the words; itfs find them. Then you would go to do the computerfs turn and youfd say, gOkay, go.h And now it would take a really long time. Itfd be hunting around the board. It would be working so hard your fan would be coming on. Youfd feel like oh, my computerfs doing something. And then it would show up with twice as many words as you found or ten times as many words as you found, whatever. And in fact, then you would say, gWell, at least it had to work hard to beat me.h And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And in fact, actually, the word list was pretty small. So in fact, it didnft -- it wasnft even actually as likely to beat you because it didnft know as many words. 125,000 words is an enormous number of words. The average -- Ifve heard different factors on this, but the average personfs vocabulary, English vocabulary, is about 20,000 to 30,000 words. It tends to be about 10,000 higher if youfve gone to college, so maybe 30,000 to 40,000. The fact that it knows 80,000 more words than the average college graduate means it knows a lot of obscure words, too. So in fact, not only does it very quickly find all those words and then produce these ridiculous words that youfve never heard of, it just -- itfs almost like itfs taunting you by doing it in a millisecond. And back then it took a while, so it was actually like okay, it beat me, but it had to work hard. But then -- so then I took over Boggle. I taught x maybe a quarter or two later and I said, gIfm going to write a lexicon. Ifm going to go up and make a little project for myself, because I donft have enough work to do.h And my goal was to make one that would do prefix, but that also would be tight enough on memory that we could actually load a much bigger dictionary file, because I thought Ifd actually make the game even more fun to play. All right, so herefs where I started. This is a little excerpt from the middle of the dictionary. gStra,h yeah, thatfs totally the thing you notice, right, when you look at this thing. You see gstraddle,h gstraddler,h gstraddlers,h gstraddles,h gstraddling,h and then these words that you canft believe youfre allowed to do this, but apparently you can put gerh and gierh and giesf and gingh on almost anything out there and it makes a valid word. So although you never really thought about the fact that gstraightforward,h gstraightforwardly,h gstraightforwardness,h are all there, there is a huge amount of repetition. And this is where I said wefre going to use some information about the domain. These arenft just arbitrary strings. Theyfre not just random sequences of letters that are picked out of the air, that they develop over time. Theyfre word roots. Theyfre suffixes. Therefs these prefixes that mean things that actually show you that looking at this little excerpt out of the middle, therefs an awful lot of words that repeat portions of other words. And that may be part of what we can capitalize on. That instead of really recording gstraightawayh and gstraightaways,h repeating those first 11 characters and adding an gs,h therefs some way that I can unify the data that I have here. The same way when we try to find codes, you have the same passage of code repeated two or three times. Wefre always saying unify it, move it into a helper and call it in three places. Itfs like can we do the same thing with our data? Can we share the parts of the words that are in common and only distinguish where they are different? Take a look at this. This is a little portion of something called a trie. A trie is spelled t-r-i-e and itfs actually from gretrieval,h but sadly, itfs still pronounced tri, just to confuse you. And it is a variant of a tree. In this case itfs a tree that has 26 children coming out of any particular node, one for each letter of the alphabet. And so starting from the root node at the top, all the words that begin with gah are actually down the A branch, and then therefs a B branch. Therefd be a C branch and a D branch. And so the idea is that all the words, like if you think of the levels of the tree are actually like -- these are all the zero if characters of the word. And then that second level is all the characters that follow that zero if character. So herefs the gabh words and the gach words and the gaxh words. Over here is the gsth words, but therefs not a gsx,h so therefs some places that actually arenft filled out in this tree. And as you keep going down further, so that the depth you would trace to a tree would actually be tracing through the characters in sequence that make words. And so in this form, tracing out a path through this tree from the root down to a particular node tells you about prefixes. So there are words that begin with gah and words that begin with gach and gace,h and then therefs actually a little notation there thatfs done by a visual of the bold and circle around it that says, gOh, and the path that led here from the root is a word.h So gah is a word and gaceh is a word and gacesh is a word. gActh is a word and so is gactedh and gactor.h And in this case, the gacth part is totally shared, so everything that comes off of gact,h gacting,h gacted,h gactor,h gactors,h can all share that initial part. And for a word like gact,h it doesnft seem that profound. But you start thinking about words like gstrawberryh or gstraightforwardh or gstratosphereh and realize that gstratospherich and gstrawberriesh and gstraightforwardlyh can all leverage the other ten or 12 characters that were already invested in the root word, and that gstraightjacketh and gstraightforwardh can even share gstraight.h But therefs actually a lot of prefixes that can be unified across the space of the dictionary. So knowing that words have these patterns helps us to do this. So letfs build that. So the first form of this -- and this, again, is a thought experiment. I did not actually write it this way because it would just be absolutely astronomically expensive. But it was a good way to kinda think about where I needed to go. I designed a new kinda trie node that had the letter and a little Boolean flag of is this the word, so whether it had the dark circle around it. And then it had a 26-member children array, so pointers to other nodes like this one. So totally recursive, staring from that root and then the idea is to use those 26 children as gah being in the first slot and gzh in the last slot to make it really easy to kinda just trace letter by letter down to the levels of the tree. So Ifd have this one root node that pointed to the initial one, and then all the words that begin with gah are off that first lot; all from gzh off the last slot. So when I want to know if a word is in the dictionary, what do I have to do in this data structure? How do I find it? I want to find the word gfar.h Where do I look? Left? Right? Down? Which way? It would be great if you would tell me. Then I would understand what my data structure is. [Student:][Inaudible] Exactly. So at the very top I say okay, gfh is my first letter. Match my gf,h get me to the F node, and now it says recursively find garh from here. So itfs very recursive. Itfll be like find the first letter and then recursively find the substring from here, working down. So find the gah out of the next node, find an grh out of the next node, and then when you get to that node, youfll check this little gis word Boolean,h and say okay, was that path I traced marked in the dictionary as leading to something thatfs known to be a word? The prefix looks exactly the same, except for that last check for the as word. So given that nodes exist in this, because further down there therefs some words, I just trace out a sequence of letters gstr,h and as long as therefs something there, then I know that away from there are subsequent children beneath it that lead to words. And adding a word does the same operation. If I need to add the word gstrawberry,h then I start looking for gs,h gt,h gr,h and at some point, I find some places where the nodes donft exist. Then I start creating them from there on down. So it is tracing what exists and then adding new nodes off the bottom, and then marking the final one as yes, this is a word. Whatfs interesting about this is that all three of these operations donft actually make any reference to how many words are in the dictionary. They are all dependent on the length of the word youfre searching for or adding into the dictionary. If therefs ten words in there, if therefs 1,000 words in there, if therefs a million words in there. But in no way does the big O depend on how big or how loaded the data structure is. Thatfs kinda wacky. The longest word, according to the Oxford English Dictionary? There you go. Youfre not going to make that on the Boggle board anytime soon because itfs only got 16 cubes. Even Big Boggle canft really make that puppy. But just so you know, 45 is not so many, right? And, in fact, actually, herefs a little thought for you. Not only is it not dependent on num words, it actually has a little bit of an inverse relationship with num words, especially the add operation. That as the data structure becomes more loaded, therefs typically less work to do in add. So if gstrawberryh is in there, and you go to add gstrawberries,h then actually you already have gstrawberrh to depend on. You just need to build off the thing. So the fact that the more words that are in there, the more likely that some of the nodes you already need are already in place, and then you can just kinda blow past them and just add your new nodes. Now, thatfs odd. You think of most data structures as really as clearly as they get heavier, it takes more work to install new things in it rather than less. [Student:][Inaudible]. Well, the thing is, theyfre organized by letter, and that 26-number array. So in fact, I donft even have to look. So I just go straight to the slot. If Ifm looking for strawberry, Ifll look for gsh and I go straight to the gth slot and then the grh slot. So in fact, if the mode was already in place, Ifm not searching for it. It really just is exactly in the grh slot. Thatfs why therefs 26 of them. [Student:]In each array? Each array is 26 and theyfre A through Z and if therefs an empty slot wefre using a null. So in this case, itfs optimized to make it super fast. I donft even have to look through the letters to find the one that matches. Therefs no matching. It just goes straight to the grh slot, gzh slot, the gsh slot. Wefre going to see this is going to be a consequence we canft tolerate, but it is, at least, the starting point for thinking about how to build a data structure. So the space usage is where this thing really, really -- so this is an extreme example of a space-time trade-off, right? Wefve got this thing that optimizes beautifully for OAV length of the word. So that means typically eight operations on average are going to be needed to add or to search for something. The space wefre using here is 106 bytes per nodes, so thatfs this 26-member, four-byte array, plus two bytes for the character and the Boolean that are up there. One hundred and six bytes is a lot of memory, and the tree has about a quarter of a million nodes. So given the 125,000 words that I said are in the input, if you build it into a tree, they take about 250,000 nodes. Thatfs an interesting sort of number to think about, though. You have 125,000 words. They take 250,000 nodes. That tells you, actually, that kinda on average, each word has about two unique nodes, that even words like gstraightforwardh and gstraightforwardly,h on average, are sharing enough of their prefix -- common prefix parts -- that therefs very little unique data added by any particular word in there that averages across the whole thing. Thatfs pretty neat to know. However, this is just not going to fly. So here I was telling you that my main memory had four megabytes, maybe, on a good day, and now Ifve actually built a data structure thatfs going to require six times that in terms of storage. Okay, we gotta squeeze that down. So the first thing we can do is to realize that those 26 -- allocated a full 26-member array, of which I only plan on using some of the slots, is pretty wasteful. So instead of saying Ifm committing to a 26 per thing, Ifll say well, how about if I have an array that actually is tight to the number of children coming out of this node? So the very first node will have 26. There are words that start with every letter in the alphabet. But from there, it very quickly winnows down. You have the letter gx.h You do not need the full 26 children coming off of gx.h Therefs only a few letters that follow gxh in the English language that are going to need to be fleshed out. You need the gxa,h you need an gxe,h but you donft need gxj,h you donft need gxk,h a whole bunch of things like that. So if I change the structure here to instead of being a 26-member array of node pointers, I make it to be a pointer to a set of nodes pointers, so this is a dynamic array that Ifm keeping track of the number of children in a second field here that the first one will be allocated to 26, the second will be allocated to eight. This is going to change a little bit of our running times, because now we wonft be able to go exactly to the gsh slot. Wefll have to go look for it. So on each step down the tree, itfll be looking through that sized array at this slot to find the right match. But it adds a constant factor, basically, on top of OAV length of the word. So the space issues of this is -- wefre just not storing all those nulls. And there were a lot of nulls. That the node itself is not just ten bytes, and then therefs some space in this dynamic array out there, which is the number of children times the four-byte pointer, that on average, a node has six children. Across the 250,000 nodes in the data structure thatfs being billed from this, so really 26 was way overkill. About 20 of them had nulls for most of the nodes. And many of them, for example, at the very bottom of the tree, have completely none. Theyfre all nulls. So you get to a word like gstraightforwardly.h Therefs nothing after that gy,h so it has zero children coming out the back of that one. And so certainly having 26 null children pointing away from that was a waste that we can tighten up. So when we get it down to this wefll have 44 bytes -- or 34 bytes for the per node. Still a quarter million nodes. Still eight and half megabytes. So wefre not winning any prizes just yet. But we are making good progress. That got us down a factor of three or four right there. So I still have the same number of nodes. All I changed was the pointers to the subsequent nodes further down the tree. So I -- the kind of letters and how the letters connect to each other -- the connectivity is still pretty much the same. Itfs just that in any particular node, how many children does it have? I was storing a whole bunch of nulls and now Ifm not storing those nulls. So the number of nodes is still exactly as it was before. Ifm going to do one other little squishing thing on this data structure and then Ifm going to flip gears for a second. So still using kinda my CS side of things, how can I squash things down? Ifm going to use a technique that we used in the heap. So in the case of the binary heap that we implemented into the PQ, youfll remember that we drew pictures in the handout that showed a heap with oh, I have a parent with two children, which has four grandchildren and so on. But that we actually compressed that down into an array, and in the array there was this pattern of here is the root node. So wefll call this level zero. And then the next two nodes are level one. And the next four nodes are level two and so on. So therefs one here, two here, therefs four here, therefs eight here. And we kinda flattened it down by generation, so looking at what was at the top and what was underneath it. When we did that, we removed all the space for pointers. Although we drew those pictures in the heap as though there were pointers there, in fact, there were not. There were no pointers being stored in the heap in this version of the PQ. And so that gets us back to kind of array like storage requirements, but by flattening it down into this array, and still kinda using the tree conceptually to work our way through it, wefre getting the advantages of that traversal thatfs based on height of tree, not length of vector. So if we do the same thing with the trie, itfs a little bit more complicated because there were a couple things about heap that made this a little easier. Heap always had two children, so there were these rules about how the tree was filled in that meant you could always compute parent and child index using this kinda divide by two, multiple by two exchange operation. In this case, what wefre going to have to do is just lay it down by generation, but we donft know how big a generationfs gonna be, so wefre actually gonna have to keep track of it manually. So the root node will have all A through Z. So this will be level one, which is A through Z here, and then level two is -- therefll be a bunch of slots here, which is the A followed by some letter, and then therefd be the Bfs followed by some letter, and then therefd be the Cfs followed by some letter, and so on. And then further down is level three, which are the three-character words. So these are all the single-character paths, two-character paths, three-character paths, kinda flattened down. And so if I look at what I changed here, my first node says it has no letter thatfs not the root node, itfs not a word, and it says itfs children begin at index one. So the next location. So itfs not a multiply by two and add one or something like that. Itfs okay, herefs where my children begin. Theyfre at slot one in the array. And then there are 26 of them. So that means that A, B, C, D are kinda in order there. If I look at A, gah is a word, and it says its children begin at index 27 and there are 12 of them, so there are 12 characters out of the 26 that can follow an gah in the sequence of a valid word. And then Bfs children begin at index 39, which is 27 plus 12. So thatfs where Afs children are sitting here at index 27 to 38 and then 39 to -- I think B has eight or so children -- and C and so on. So we flattened it all down by generation into an array. That means all the space for pointers has now been removed. One serious consequence of this, though, is that the pointers got us flexibility, that that insert and remove was dependent on the idea that pointers really being there, pointers buying us the ability to insert and delete without doing a shuffle. That now, once I have flattened it, I have suddenly bought back the same problems that arrays have, which is if I have the words here that have gabh for abalone, and gach for act and I donft happen to have any words with gad.h And I try to add the word gadd,h that in order to put gadh in there, wefre going to have to move everything over and then therefs going to be a lot of shuffling and what not. So Ifm starting to build a data structure thatfs losing some flexibility but saving space. So Ifm, in some sense, focusing my attention on how can I build a data structure thatfs really fast to search, even if it was a little harder to build? Because it turns out the thing I most care about is making sure it can actually very, very efficiently look up words. Maybe building it once, being a little slow, isnft so painful. So the space issues in this gets us down to just ten bytes per node. All the pointers are gone. So there was 24 bytes of pointers that wefve just eliminated. And that means wefve got two and a half megabyte memory footprint right now. So wefre kinda getting close to hitting the range for binary tree and hash tables. Now Ifm going to go back to the domain again. So I worked on it kinda from the CS side, thinking about things I know about data structures and reorganizing arrays and heaps and pointers. Ifm going to go back and look at that domain again and see if therefs something else I can kinda take advantage of. So this is a different cross-section of the dictionary showing the words gadapted,h gadaptor,h gadaptors,h gadapting,h gadaption,h adaptions,h gadapts.h So eight words that are all built off the gadapth root with a bunch of suffixes. I look sorta over a little bit further. Ifve got gdesert,h gdeserted,h gdeserter,h gdeserting,h gdesertings,h gdesertion,h gdesertions,h gdeserts.h Okay, and gdetecth and ginserth and gperfecth and ginventh and ginterrupth that all are verbs for which you can apply the past tense in the gerund form and the gionfsh that tack onto that root word to give you these suffixes. And each of these words that has shown up here has exactly the same set of suffixes that can be applied to it. So I did this thing about prefixes. I went out of my way to find that -- so all these words could share that same prefix, gassert,h gasserter,h gassertings,h gasserting.h But is there some way that I could take this idea of gcorrupth and gasserth and ginsert,h and then once I get down to that gt,h where they end, is there a way where they can also share their backend parts? That those gingfsh and gionfs,h can they be unified, as well? Well, why not? Why not? So here is the first version of whatfs going to be the DAWG, the Directed Acyclic Word Graph. It takes the idea of unifying the prefixes and now applies it to the suffixes. That there are a lot of words that end in the same letters, as well as start in the same letters. Why not just apply the same kind of unification and sharing on the back end? So here comes in gadapth and gadopth and gbasedh and gporth and out the back end is gportsh and gportionh and gportingh and gporterh and gadopterh and gadoptedh and gbastingh and gbastion,h all coming off that sharing all the endings, because they have the same words that are viable and the same connections that are between them. So the idea that all sharing the gerh and the gerh leading in to the gs.h And that same gs,h for example, being shared from gportionh and gbasteh and gadopth and gadopter,h all coming into the same ending of oh, these things that end in gsh are words. Herefs an gsh that is a word for them to share. So it is directed. So it is a graph. So once we remove this -- the idea that therefs a single path from the root to any node, wefre no longer a tree structure. So we can get to gth by going through gadapth or gadopth or gport,h and so that no longer fits the definition of what is a tree. Itfs become a graph. We can get to these nodes from different places. The arcs go one way in this case. We canft go back up the tree to find the words in reverse. There are no cycles. So acyclic, right, you canft just wank around some portion and get this many bananas as you want. And they have been with certain nodes that are marked, in this case, the same way they were in the trie. This path from the root node to here does represent a word in this situation. So building this is a little bit complicated. Ifll give you a little bit of the idea of the intuition for it, and I will not get too deep into it. But largely the way you do it is you build the trie, so that it doesnft have the unification of the suffixes; it only unifies the prefixes. And then you work backwards on the suffixes. So in some sense you look at all the words that end, for example, in gsh in the dictionary. And you find all the ones that just end in one gsh that goes nowhere, and you say look, this gsh looks like every other gsh in this thing. All these words that are just plurals or verbs that can have an gsh tacked on. Letfs share that gs.h And then so you take however many sfs there were, 1,000 of them, and you say these are all the same gs.h And then what you do is you look at all the nodes that lead into that gs.h So youfre kinda traversing the graph in reverse, and you say well look, herefs a bunch of people who lead in on an ge.h And if that geh is a word or not, there might be some group that has geh thatfs also a word, so like the word gappleh and gapples.h But sometimes the geh is not, where that wasnft a stopping place, letfs say, for gpartiesh it wasnft. So all the ones that are a word you can unify on the geh that was a word, and all the ones that arenft are that way can unify on that. So you just work your way backwards from there. So you say what letters led into that ge?h And so kinda from the end of the words backwards, youfre looking for all the people who can end in an gs,h and end in an ge,h and end in an gies,h and unify them up, making sure that they have all the same outgoing connections. So for example, grunningh has an gingh at the end. So does gswimming,h but therefs gswimminglyh and therefs not grunningly.h So you have to be careful about unifying things that they would kinda have to have all identical properties from this point down to the bottom of the tree to be unifiable. But itfs a pretty neat process. Itfs actually whatfs called DFT subset minimization, so if you want to learn a little bit about it, thatfs the word to look up. And so if I do that, and I build this as a DAWG, Ifm still using the same structure definition here, but what I am doing is unifying suffixes as well as prefixes and that, actually, is gonna change the number of nodes that I need drastically. It goes down to 80,000 from my initial 250,000. And if you think about the number of words again, there were 125,000 words in the dictionary, the DAWG has just 80,000 words. So it means, actually, once you start unifying the suffixes, itfs like most words donft even have -- or many of the words donft even have nodes of their own that are unique to them whatsoever. They just exist kinda as portions of other words that have all been joined together and compressed into one package. So 80,000 nodes total, about two-thirds of the number of words total. So we have ten nodes -- ten-byte nodes and 80,000 of them. We are down to under a megabyte. So wefve crossed that little interesting barrier of the data structure on disk. The data that was fed into it, those words, was over a megabyte and now, actually, we have created a representation that actually is compressed, that uses less space than the text file did in the first place. And thatfs really because it is finding all these ways in which words are like each other and sharing, making use of the existing structure in the DAWG to add new words without adding bulk. So once Ifve done this I have to say that Ifve made the data structure even less flexible. So to be clear, each of these steps has a cost somewhere else that made the code more complicated. It also made it so that now that I have all this sharing back here, that when Ifm adding a new word, so I go back to look at this picture here, that I go in to add a word and I say oh, actually, the word gportioningh is a word. Well, gadoptioningh is not. So that will necessitate if I go in to add the word gportioning,h that I actually have to split it off of its existing unified branch, because if I just tacked on the gingh on the end of gportioningh there, that it turns out that gadaptioningh and gadoptioningh would suddenly become words, too. So actually I have to be extra careful once Ifve got it unified this way that when I add new words that I donft actually accidentally add some additional words that I didnft plan on because these paths have been shared. So it requires a little bit more careful management. Again, the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried. So I build it off the data structure -- the input data file I had, that then would operate very, very efficiently, and I wasnft planning on making a lot of runtime changes to it. I get this down to this and then the last thing Ifm going to do to it is go back to kinda the CS side of it and see if I can squeeze it down kinda in terms of just data structure, whatfs being stored per node. So I have 80,000 of these. So typically, when you look at a structure, itfs very rare that the idea if you change something from an int to a char, so it was a two-byte thing into a one-byte thing, that youfre going to have any profound impact on the total size needed. But if you happen to know youfre going to make a lot of these structures, tens of thousands, almost 100,000 of them, then it turns out every byte counts. If I can get this thing down from, in this case, ten bytes to eight bytes, or to six bytes or four bytes, then it will cut my memory by quite a factor. So what I did in the final version of it is I used a little technique called bit packing, which is to actually stash stuff in the absolute minimum possible space that I can fit it into, taking advantage of things like, for example, if there are only 26 letters in the alphabet, a character is actually a full eight-bit value, which can hold numbers from zero to 255. And I just donft need that much space. Ifm not using the yen sign, Ifm not using the parentheses, Ifm not using the digits. So having, in some sense, the capacity to store all those distinct characters is actually a cost Ifm paying on every character that I donft need to. So in fact, I squeezed it down by dividing up into the sub-bit level. This something thatfs talked about a little bit in Chapter 15 if youfre at all curious. Itfs not what I consider core material for 106B, but Ifm just sorta mentioning it to kinda complete the picture. Where five bits, given that a bit is either zero or one, that I have five of them connected together, then I have 25 different patterns that I can represent in that amount of capacity there, and thatfs 32 different patterns, which is enough for my characters, plus a little slack. I use exactly one bit for is word, rather than a full Boolean. All I need to know is yes or no. I also changed the way I did knowing that you were the last child of a generation, rather than having A say, gthere are 12 children here.h I just said, go to index 27 and start reading, and one of them will tell you itfs the last. So each of them says, gNo, Ifm not the last,h gNo, Ifm not the last.h And then when you get to the last one, it will be marked as gyes,h and thatfs how youfll know -- thatfs where A ends and the next one beings. And so if I can get it down in this way Ifve taken what was a ten-byte structure down to four. And it turns out therefs often reasons why trying to get it down to a multiple of two is actually a critical barrier. That if it were five bytes, itfs actually likely to be eight, just because of whatfs called alignment and padding restrictions. So if were at seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it still might not make any change. It might be, actually, artificially decided that in each of the structures is going to be eight behind the scenes. But getting it down to four actually tends to be another one of those critical barriers where okay, four actually will be smaller than something that was ten. And so I take my 80,000 nodes, I get them to be four bytes each, and then I now have a data structure thatfs about a third of a megabyte in memory, as opposed to the over a megabyte on disk in the text form. And so thatfs how it does what it does. So if youfre going to look at the code, youfll see that there is this complicated bit pack structure that it does look for things char byte using a recursive character-by-character strategy of start at the top, find the matching character, then go to the children index, and then work your way down the children to find the next matching character, and so on to find its way through the data structure. So [inaudible] is where the child index, or for example, the idea that A says my children start at 27 or B starts at 39, and so I just need a number there, but Ifm using a slightly smaller than an integer, just to make them all fit very tightly. So Ifll tell you a couple of interesting things about it, actually, before I close this off, which is the lexicon.dat file actually just is an in-memory representation of what the DAWG data structure looks like. And so in fact, itfs very easy to take the whole data structure. It doesnft have any pointers in it. Pointers, actually, tend to be a little bit -- make data structures harder to rebuild and to dehydrate to disk because they have to kinda be wired up in memory. You donft know what memory youfre going to get from new and what the addresses are. This one, actually, doesnft have any pointers. Everything is actually kinda self-contained. Itfs just one big array of these blocks. And all the information is relative in there. It says go to the index 27, index 45. So you can take that whole block and just write it to disk, 300 bytes worth, read it back in 300 bytes worth and it kinda is rehydrated without any extra effort. So thatfs actually what the lexicon.dat file is, itfs just a dehydrated form of the DAWG having been build in memory. That makes it very, very fast to load. At that point, though, itfs super inflexible. If you want to add a word to it, the way the structure is build, youfd have to kinda add nodes and make sure it wasnft glomming on to some existing words that were nearby, and so it would be a lot of work to actually edit the DAWG in place. In fact, when you ask it to add a word, it actually doesnft edit that one at all. You say Ifd like to put a new word in. It says Ifm just -- this onefs so perfect the way it is. Ifve built this one, I love this one, Ifm very happy with it. If you want to put some other words into it, it just keeps a second data structure on the side, an auxiliary one over here, where it sticks the additional words you ask it to put in. I have to say that seems like a very obvious thing to do, but in fact, I didnft even think of that for years. We had the lexicon. I said you canft use the lexicon to make new word lists or edit them. You can only load this one, beautiful, perfect one. You canft do anything else. And at some point people kept saying, gI wish I could put some words in the lexicon. I wish I could use it for the word list to keep track of the human and computer players.h Ifm like, gWell, you just canft do it. You donft understand. Itfs very complicated. You just donft understand my pain.h Whatever. Whatever. I had excuses. And then, really, after some number of years of just being a baby about it, Ifm like oh, of course, why donft I just keep another structure off to the side? Itfs an abstraction. Itfs a word list. You donft need to know what I do. The fact that I keep a DAWG for some of the words and just an ordinary set for some of the other words is really completely irrelevant. And as long as when you ask for words I look in both and I tell you, gYes, itfs there,h why do you need to know where I keep it? Okay, so it took me a while to buy abstraction, even though I teach it all the time. It just wasnft totally making sense. So a neat little thing, just kinda keeping both behind. And so it turns out that I learned about this, actually, from a paper that I read at the -- in the early e90s about somebody building a Scrabble player, and they wanted a dictionary that supported kinda finding plays on the board and being able to extend words. And so knowing things about root words and how they extend and prefixes and suffixes was kinda the original motivation. And this kind of data structure is actually really great for any kind of word playing. So you have anagrams that you want to rearrange or words that you want to extend past suffix and prefix that this kind of data structure is really optimized for doing that kind of traversal and kinda leveraging the existing patterns in words to save space while doing so. So it was a really fun little project to do. It was kinda neat to think about. Even though today, if you needed a lexicon, you could just use a set. Itfs fast enough. It takes a ton of memory, but you know what, memoryfs cheap. So this is not the kind of thing you would need to do today except when you were in some kinda embedded, low memory, very, very tight processor environment, but it still makes for kinda interesting sort of artifact to think about well, how you get there and what kind of things you can use. So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will see you then.