So in the kickoff post of my series on data structures and algorithms I'd like to begin with a relatively simple but handy little data structure: the trie. If you want to jump ahead and look at a very simplistic implementation of a trie data structure (only the insert and dump operations have been completed), I've put my experimental code up on GitHub here.
A clever little play on the word re*trie*val (though I, and many others, insist on pronouncing it "try"… suck it etymology), a trie is a key-value store represented as an n-ary tree, except that unlike a typical key-value store, no one node stores the key itself. Instead, the path to the value within the tree is defined by the characters/bits/what-have-you that define the key itself. Yeah, that's pretty abstract, why don't we just look at an example:
In this construction I've chosen the following set of keys:
As you can see, each character in the key is used to label an edge in the tree, while the nodes store the values associated with that key (note, in this example I've chosen to use the keys as values as well… this entirely artificial, and a bit confusing. Just remember, those values could be absolutely anything.) 1 Typically these keys are strings, as depicted here, although it's entirely possible to build a bit-wise trie that can be keyed off of arbitrary strings of bits. To find the value for a key, you take each character and, starting with the root node, transition through the graph until the target node is found. Or, as pseudo-code:
find(root_node, key): current_node = root_node current_key = key
while (current_key.length > 0): character = current_key
if current_node.has_edge_for(character): current_node = current_node.get_get_for(character).endpoint else throw "ERMAGERD"
Strangely, a very similar algorithm can be used for both inserts and deletes.
Some Interesting Properties
The trie offers a number of interesting advantages over traditional key-value stores such as hash tables and binary search trees:
- As mentioned previously, they have the peculiar feature that inserts, deletes, and lookups use very similar codepaths, and thus have very similar performance characteristics. As such, in applications where these operations are performed with equal frequency, the trie can provide better overall performance than other more traditional key-value stores.
- Lookup performance is a factor of key length as opposed to key distribution or dataset size. As such, for lookups they often outperform both hash tables and BSTs.
- They are quite space efficient for short keys, as key prefixes are shared between edges, resulting in compression of the graph.
- They enable longest-prefix matching. Given a candidate key, a trie can be used to perform a closest fit search with the same performance as an exact search.
- Pre-order traversal of the graph generates an ordered list of the keys (in fact, this implementation is a form of radix sort).
- Unlike hashes, there's no need to design a hash function, and collisions can only occur if identical keys are inserted multiple times.
Because tries are well-suited to fuzzy matching algorithms, they often see use in spell checking implementations or other areas involving fuzzy matching against a dictionary. In addition, the trie forms the core of Radix/PATRICIA and Suffix Trees, both of which are interesting enough to warrant separate posts of their own. Stay tuned!
1. Interestingly, if you looked at this example graph, you'd be forgiven for assuming it was an illustration of a finite state machine, with the characters in the key triggering transitions to deeper levels of the graph.
So, generally speaking, I've typically adhered to the rule that those who develop software should be aware of various classes of algorithms and data structures, but should avoid implementing them if at all possible. The reasoning here is pretty simple, and I think pretty common:
- You're reinventing the wheel. Stop that, we have enough wheels.
- You're probably reinventing it badly.
So just go find yourself the appropriate wheel to solve your problem and move on.
Ah, but there's a gotcha, here: Speaking for myself, I never truly understand an algorithm or data structure, both theoretically (ie, how it works in the abstract, complexity, etc) and practically (ie, how you'd actually implement the thing) until I try to implement it. After all, these things in the abstract can be tricky to grok, and when actually implemented you discover there's all kinds of details and edge cases that you need to deal with.
Now, I've spent a lot of my free time learning about programming languages (the tools of our trade that we use to express our ideas), and about software architecture and design, the "blueprints", if you will. But if languages are the tools and the architecture and design are the blueprints, algorithms and data structures are akin to the templates carpenters use for building doors, windows, etc. That is, they provide a general framework for solving various classes of problems that we as developers encounter day-to-day.
And, like a framer, day-to-day we may very well make use of various prefabbed components to get our jobs done more quickly and efficiently. But without understanding how and why those components are built the way they are, it can be very easy to misuse or abuse them. Plus, it can't hurt if, when someone comes along and asks you to show off your mad skillz, you can demonstrate your ability to build one of those components from scratch.
Consequently, I plan to kick off a round of posts wherein I explore various interesting algorithms and data structures that happen to catch my attention. So far I have a couple on the list that look interesting, either because I don't know them, or because it's been so long that I've forgotten them…
- Skip list
- Fibonacci heap
- Red-Black tree
- Radix/PATRICIA Tries
- Suffix Tries
- Bloom filter
- Various streaming algorithms (computations over read-once streams of data):
- Heavy hitters (finding elements that appear more often than a proscribed freqency)
- Counting distinct elements
- Computing entropy
- Topological sort
And I guarantee there's more that belong on this list, but this is just an initial roadmap… assuming I follow through, anyway.
Using Git to push changes upstream to servers is incredibly handy. In essence, you set up a bare repository on the target server, configure git to use the production application path as the git working directory, and then set up hooks to automatically update the working directory when changes are pushed into the repository. The result is dead easy code deployment, as you can simply push from your repository to the remote on the server.
But making this work when the Git repository is being hosted on Windows is a bit tricky. Normally ssh is the default transport for git, but making that work on Windows is an enormous pain. As such, this little writeup assumes the use of HTTP as the transport protocol.
So, first up we need to install a couple components:
Note: When installing msysgit, make sure to select the option that installs git in your path! After installation the system path should include the following1:
C:\Program Files\Git\cmd;C:\Program Files\Git\bin;C:\Program Files\Git\libexec\git-core
Now, in addition, we'll be using git-http-backend to serve up our repository, and it turns out the msysgit installation of this tool is broken such that one of its required DLLs is not in the directory where it's installed. As such, you need to copy:
Once you have the software installed, create your bare repository by firing up Git Bash and running something like:
$ mkdir -p /c/git/project.git $ cd /c/git/project.git $ git init --bare $ git config core.worktree c:/path/to/webroot $ git config http.receivepack true $ touch git-daemon-export-ok
Those last three commands are vital and will ensure that we can push to the repository, and that the repository uses our web root as the working tree.
Next up, add the following lines to your httpd.conf:
SetEnv GIT_PROJECT_ROOT c:/git/
ScriptAlias /git/ "C:/Program Files/Git/libexec/git-core/git-http-backend.exe/"
<Directory "C:/Program Files/Git/libexec/git-core/"> Options +ExecCGI FollowSymLinks Allow From All </Directory>
Note, I've omitted any security, here. You'll probably want to enable some form of HTTP authentication.
In addition, in order to make hooks work, you need to reconfigure the Apache daemon to run as a normal user. Obviously this user should have permissions to read from/write to the git repository folder and web root.
Oh, and last but not least, don't forget to restart Apache at this point.
Pushing the Base Repository
So, we now have our repository exposed, let's try to push to it. Assuming you have an already established repository ready to go and it's our master branch we want to publish, we just need to do a:
git remote add server http://myserver/git/project.git git push server master
In theory, anyway.
Note: After the initial push, in at least one instance I've found that "logs/refs" wasn't present in the server bare repository. This breaks, among other things, git stash. To remedy this I simply created that folder manually.
Lastly, you can pop over to your server, fire up Git Bash, and:
$ cd /c/git/project.git $ git checkout master
So, about those hooks. I use two, one that triggers before a new update comes to stash any local changes, and then another after a pack is applied to update the working tree and then unstash those local changes. The first is a pre-receive hook:
cd `git config --get core.worktree` git stash save --include-untracked
The second is a post-update hook:
cd `git config --get core.worktree`
git checkout -f git reset --hard HEAD git stash pop
Obviously you can do whatever you want, here. This is just something I slapped together for a test server I was working with.
1. Obviously any paths, here, would need to be tweaked on a 64-bit server with a 32-bit Git.
So, out of a certainly level of idle curiosity, a few months back I decided to contact my community league1 to find out what would be involved in getting a Wikipedia:Community garden started in my area. Community gardens are, to me, an intriguing concept: get access to some land (either city property or donated private property), get members of the local community together, and then grow food! Of course, it's particularly interesting to me as a guy who's always lived in a small house with little to no room for a garden, leaving a community garden as the only option I'd have to get access to a decent sized plot of land. And I suspect, deep down, I'm actually a closet hippy yearning for a commune…
Of course, there's no shortage of community gardens in the city, but gaining access to them can be tough, and none are particularly close to my home. Meanwhile, I live along a rather large hydro corridor, which means a ton of seemingly under-utilized greenspace, in a neighbourhood dominated by small homes with tiny yards, or high density residential in the form of three-story condo blocks who, needless to say, have no yard at all. So it would seem like the kind of area where a community garden would flourish.
And so I emailed my local community league, and then promptly put the whole idea out of my mind. I tend to have a short attention span like that. So colour me surprised when a few weeks later I received a reply from the current league webmaster indicating that she'd be very happy to bring the idea to the league board… she just had one question: would I be willing to take point on this project?
And it may be totally crazy, but… I said yes. So, she'll be bringing the topic up to the board this week, and all signs indicate that they'll provide their support, which means the ball may actually start rolling on this.
1. Fun fact: community leagues in Edmonton are quite powerful compared to similar organizations in other cities (Edmonton was also the first city in Canada to adopt these kinds of organizations). If you want to have an influence on politics in your area, the two most important things you could possibly do are a) vote for your city councillor, and b) get involved in your community league, as they typically handle park development (including skating rinks, playgrounds, and so forth), manage local community programs, and get involved in land use and transportation issues.
So, a couple years back I started doing some subcontracting work for a buddy of mine who runs a little ColdFusion consultancy. As part of that work, I took ownership of one of the projects another sub had built for one of his client, and the experience has been… interesting.
See, like PHP and Perl, ColdFusion has the wonderful property of making it very easy for middling developers to write truly awful code that, ultimately, gets the job done. And so it is with this project. My predecessor was, to be complementary, one of those middling developers. The codebase, itself, is a total mess. Like, if there was a digital version of Hoarders, this code might be on it. But, it does get the job done, and ultimately, when it comes to customers, that's what matters (well, until the bugs start rolling in).
Of course, as a self-respecting(-ish) developer, this is a nightmare. In the beginning, I dreaded modifying the code. Duplication is rampant, meaning a fix in one place may need to be done in many. Side effects are ubiquitous, so it's difficult to predict the results of a change. Even simple things like consistent indentation are nowhere to be found. And don't even dream of anything like automated regression tests.
Worse, feeling no ownership of the code, my strategy was to minimally disturb the code as it existed while implementing new features or bug fixes, which meant the status quo remained. Fortunately, around a year ago I finally got over this last hump and made the decision to gradually start modernizing the code. And that's where things got fun.
One of the biggest problems with this code is that data access and business logic are littered throughout the code, with absolutely no separation between data and views. And, remember, it's duplicated. Often. So the first order of business? Build a real data access layer, and do it such that the new code could live beside the old. Of course, this last requirement was fairly easy since there was no pre-existing data access layer to live beside…
So, in the last year, I've built at least a dozen CFCs that, slowly but surely, are beginning to encompass large portions of the (thankfully fairly simple) data model and attendant business logic. Then, as I've implemented new features or fixed bugs, I've migrated old business logic into the new data access layer and then updated old code to use the new object layer. Gradually, the old code is eroding away. Very gradually.
Finally, after a year of this, after chipping away and chipping away, finally, while there's still loads of legacy code kicking around (including a surprising amount of simply dead code… apparently my predecessor didn't understand how version control systems work--if you want to remove code, remove it, don't comment it out!), the tide is slowly starting to turn. More and more often, bugs that need to be fixed are getting fixed in one place. New features are able to leverage the object layer, cutting down development time and bugs. And some major new features coming down the pipe will be substantially easier to build with this new infrastructure in place. It's really incredibly satisfying, in a god-damn-this-is-how-it-should-be sort of way.
The funny thing is, this kind of approach goes very much against my natural instincts. Conservative by nature, I'm often the last person to start rewriting code. However, if there's one thing this project has taught me (along with a couple wonderfully excited, eager co-workers), it's that sometimes you really do have to gut the basement to fix the cracks in the foundation. And sometimes, you just gotta tear the whole house down.
Well, yet another long blogging hiatus. So what's so important that I would take the time to author yet another scintillating installment? Why, a knitting project, of course!
Some good friends of ours are expecting, and as I often do, another baby blanket is thus in the queue. This one, however, is a bit unique, in that the mother has a very specific, and I think fairly awesome, request: she wants a Tux blanket.
Of course, with some video game knitting experience behind me, throwing together a pattern for this is pretty straight forward:
- Knit a swatch to determine row and stitch gauge. This is really important, as this determines our pixel aspect ratio.
- Based on desired measurements, calculate the size of our canvas by multiplying the row and stitch gauges by the target width and height respectively (in this case, 36x48 inches for 162 x 288 pixels).
- Find decent source image.
- Scale image to fit into desired canvas and layout, making sure to take into account our aspect ratio!
- Apply "posterize" filter to limit to the number of colours in our knitting palette.
- Scale back up by 6-8 times, and use the Gimp's grid generator plugin to transform into a pattern.
Then, I just split the image into three pieces and printed out in portrait mode. Voila! Pattern complete!
Next step, actually knitting the thing (I've already got materials picked out).
As even the most basic Vim user knows, Vim, upon startup, sources a file in the user's home directory called .vimrc (sometimes _vimrc or other variants, depending on the platform). Traditionally, this file is used to store user customizations to Vim. We're talking things like settings, user-defined functions, and a whole raft of other stuff. But it has a rather annoying limitation: it's global. Of course, it's possible to condition a lot of settings based on filetype and so forth, but ultimately this isn't useful if you want to be able to specify settings at a project-level (for example, build settings, search paths, fold settings, etc). And this is where the localvimrc plugin comes in.
With localvimrc loaded, Vim, upon opening a new file or switching buffers, will search up through the directory hierarchy, starting from the location of the file, to find a file named .lvimrc. If such a file is found, the file is sourced just like any other vimrc file. So now, you can place those project-specific configuration items in a .lvimrc in the top-level folder of your project, and voila!, you're good to go.
Of course, this alone is pretty damned useful, but there's another somewhat less obvious but handy feature of localvimrc files: they make it possible to find the root directory of your project. And that is exceptionally useful for a few purposes:
- You can set up CommandT to always search from the top of your project, regardless of where you invoked Vim. I love this because I tend to navigate around in the shell and then edit files willy-nilly.
- You can configure project-level build commands which understand how to jump to the top of your project to run them.
- You can set up the Vim search path so that gf always works.
- You can load up project-wide cscope and ctags files.
- Probably lots of other stuff.
So, how does this work? Well, below is a sample of one of my lvimrc files:
if (!exists("g:loaded_lvimrc")) let g:loaded_lvimrc = 1 let s:rcpath = expand("<sfile>:p:h")
exec "map <leader>t :CommandTFlush<cr>\\|:CommandT " . s:rcpath . "<cr>" exec "set path=" . s:rcpath . "/**" endif
The first thing you'll notice is the guard. The lvimrc file is loaded whenever a buffer switch occurs, so this allows you to control which things are evaluated every time the file is sourced and which are only executed once.
Anyway, the real magic is in the subsequent lines. First, we use the expand() command to get the canonical path to the file being sourced (remember, this is our .lvimrc file, so this will be the top-level directory of our project). Then, we use that information to remap the CommandT command to run from the top-level project directory. Nice!
And, as I mentioned, you can do a lot of other things here. You can see the second line sets up our search path so we can gf to files in the project, as an example. Personally, I actually created a function in my .vimrc file called LocalVimRCLoadedHook(rcpath) that contains a lot of standardized logic for handling projects. You can find samples of all that on my Vim-Config GitHub Project.