Here’s another solution to the Secret Santa code kata, this time a scouting mission into functional programming with F#.
To recap:
My first solution, written in PowerShell, relied on selecting random pairs of people and using a “hill-climbing” algorithm to avoid getting stuck.
My second solution I constrained to be deterministic — no randomness.
This one was more about trying to write something meaningful in F# using a problem I’m by now familiar with. Take a look at the code:
As you can see, I left myself a few to-do items. I boxed the effort at three hours, which is why I went ahead and borrowed the shuffle logic from the internet.
The first hour was learning to speak F# again, especially the method signatures. I played with F# when it was in beta, so I didn’t go into the kata stone cold. However, it took me a while to troubleshoot why my methods didn’t have the signature I expected.
I spent the most time writing the pairings function. Once I had the insight of using the match operator, the algorithm came together very quickly. I decided that I wasn’t happy with it being deterministic. I found that if I provided a file that was already in an appropriate order, pairings regurgitated the list. Dissatisfied, I went about trying to shuffle the array. Not being familiar with unit testing in F#, I used the F# Interactive console. I got frustrated with trying to implement the shuffle algorithm, so I decided to study a working copy.
The hard part came in integrating it with my code. I resorted to using a while loop, which I know from reading about functional programs means that I’m not thinking of the problem through. I believe I couldn’t convert this into a recursive function because shuffle returns a unit (that is, it does the work inline). If it returned an array, I bet I’ve have more luck. I’ll have to experiment further.
My attempts to rewrite the loop badly broke the program. I used comments as poor man’s source control, and I got lucky. I really should have check in the first working version into my local Subversion repository.
Some time passed. I managed to look at the do-while problem again, and after about 90 minutes, I solved both it and another issue that bugged me.
Here’s the second revision:
The do-while loop was solved by making the shuffling function return the shuffled contents instead of unit (void in C# parlance).
The second change was to get away from arrays. Arrays in F# are mutable, and I wanted to only use immutable data structures as much as possible.
The shuffler was the main barrier. Once I hit upon the idea to associate each person with external data, in this case a random weighting, the rest fell into place. You may also notice some language usage improvements, such as using a.Head instead of a.[0]
Here’s another solution to the Secret Santa code kata, this time in my “native” programming language of C#. This is the second solution I’ve written this week to this problem. Code katas are about experimenting with different approaches to simple but not trivial problems.
My first solution, written in PowerShell, relied on selecting random pairs of people and using a “hill-climbing” algorithm to avoid getting stuck. This time, I gave myself the constraint that the solution had to be deterministic — no randomness. I had been toying with the idea of using an Abelian ring. A couple of common examples are the hours on a clock or the modulus (remainder) operator in math. But I couldn’t decide how I’d handle iterating through the members of that ring without duplicates. I determined that I’d need to re-sort the list.
I wrote this solution using test-driven development (TDD), meaning I wrote my tests before implementation — I won’t show all of those tests today. As it turned out, I didn’t need to code a Ring class. I find when I do TDD, I often don’t end up writing half the code I thought I’d need!
Unlike my PowerShell solution which just used strings, I knew I wanted to create a data object to store the incoming rows of data in, which I named the Participant class:
For a while, I thought I might need to implement IEquatable<Participant>, Equals(), and GetHashCode() in order to compare Participants, but again, TDD proved me wrong.
The plan for this approach was to:
1. Parse the incoming list
2. Resort the list
3. Display the results
The act of resorting took the majority of the development time. I created a ListResorter class to do the work.
I started by writing tests….
These are the first two tests I wrote. I decided the simplest operation I needed to resort a list was the ability to swap two items in that list. Once I had that working, I picked the next piece, the ability to decide if two things can adjoin (“to be close to or in contact with”), and I proceeded from there.
Notice that the LineResorter<T> constructor takes a list to operate against, and a function that compares two instances of type T and determines if they can be adjoining in the list. In the case of my unit tests, I used (x,y) => x % 2 != y % 2, which is a fancy way of saying that two odds or two evens can’t be next to each other in the list. I wanted to use a different type (int) than I’d be using in my real use case to make sure I didn’t make any assumptions about the nature of type T. This comparison was the first one for two numbers that came to mind.
Each time I needed functionality out of the ListResorter, I wrote a test. I watched it fail, then I made all the tests pass. If I saw opportunities to refactor, I took them whenever all my tests were green (passing). By the time I was done, I had about a dozen tests and this class:
This class has two public methods, GetNextIndex() and Resort(). That Abelian ring idea still lives in the GetNextIndex() method, which says that the next member after the last in the list is the first, making the list circular. Resort() does what you would expect.
The other methods in the class are marked internal, so that my TDD tests can access them via the [Assembly:InternalsVisibleTo()] attribute in the Assembly.cs code file. After design is done, I would consider rewriting the test methods that talk to internal methods so that they are exercised through the public method. I don’t want my unit tests to be break should someone decide to change the internal implementation of these methods. You can see a bit of this thinking in the ThrowIfIndexesAreInvalid() method. I pulled this method out to avoid duplication of code during refactoring, where I already had unit tests in place and thus I didn’t need to write new ones.
Once I had ListResorter working, writing a console wrapper was easy:
Most of the work done in the console app is formatting the output for display. The heavy lifting is done by the ListResorter class.
This program outputs data like:
PS C:temp> .SantaKataApp.exe .names_santa.txt
Luke Skywalker <luke@theforce.net> gives a gift to Toula Portokalos <toula@manhunter.org>
Toula Portokalos <toula@manhunter.org> gives a gift to Leia Skywalker <leia@therebellion.org>
Leia Skywalker <leia@therebellion.org> gives a gift to Gus Portokalos <gus@weareallfruit.net>
...
I met my goal: the output is the same every time given the same input file.
I time-boxed the effort at two hours and I came close to reaching my limit. I initially began by writing an iterator, but abandoned it when I realized that I only wanted to traverse the collection once.
There is more that could be done to this program, of course. For example, I’m not happy about DisplayParticipants using GetNextIndex(). The only reason that DisplayParticipants() needs a ListResorter is to use its ring-like GetNextIndex() method. This functionality should have been extracted into its own class.
The team at work is doing a Code Kata, the Secret Santa kata from Ruby Quiz. In it, you are given a list of names and emails and need to pair givers with receivers, with the only constraint being that a giver and receiver can’t have the same surname.
I wrote my solution in PowerShell. I wanted to use native PowerShell functionality, so I didn’t import any .NET libraries. I know this post is brief, but I wanted to get my solution posted with a little commentary while it was still fresh in my mind.
When run like this:
PS> .santa.ps1 .names.txt
It gives output like:
Giver Receiver
----- ---------
John Smith <jsmith@example.com> Jane Doe <jdoe@example.com>
...
Here’s the code:
It reads the contents of the file and writes each line to a hash table, where all the receivers are $none (a special value I created which I used like $null).
As long as there are unmatched people, I go through an algorithm to select a random gift giver who doesn’t have a receiver ($from) and a receiver who doesn’t have a giver ($to). The split(” “)[1] yields a person’s surname.
I used a variant on the “Hill Climbing” algorithm mentioned on the site. If I run into a situation where there are no more matches, I take a random person who could have been a match and throw them back into the pool of candidates. In this way, the algorithm never gets stuck.
At the end, I call my ConvertTo-Array function to beautify the output. Without it, the hash table columns are labeled “Name” and “Value”, and that didn’t satisfy me. (I couldn’t get Format-Table to help.)
I added a progress bar to see how often this part of the code is invoked, and it gets called once or twice per run on average. However, the script itself has decent performance. It takes a second or so to match 40 names.
Please let me know what you think or ask questions. I’m happy to go into more detail.
“You don’t need more time, you just need to decide” - Seth Godin
At work, there are a lot of things that I’ve identified for improvement.
Where I work is largely a Java shop, but there are a couple of large applications that are .NET and a growing group of .NET developers. And there are a lot of amazing opportunities here.
The company has a good track record of agile delivery using Scrum. However, I see areas for improvement also. The job of scrum master has become institutionalized as just a meeting facilitator. The responsibility of the scrum master as team coach has been lost in translation from training done there a few years ago.
Our current scrum master wanted the team to attend introductory scrum training. As a certified scrum master, I was anxious to hear what the agile coach on staff would say, and I was gratified to agree with his curriculum. It’s a matter of practice in our group.
Fortunately, I am well-suited to help move the team toward a steady cadence of delivery. We need to get our product owner away from firefighting mode and to get what he and our customers need within the confines a fixed sprint backlog. Having a strong scrum master will help.
And we need to get beyond our large estimates for small projects due to fear of the unknown with this application — turnover has left most of the delivery team new to the code base.
There’s a desire for a .NET community of practice. I’ve started the ball rolling with a book club. The first book we’ll read together is Working Effectively with Legacy Code by Michael Feathers, which is all about getting pre-written systems under unit test. We’re also starting our first code kata this week, an old Ruby Quiz about Secret Santas.
Both the book club and katas are exercises to start talking about software design in a simpler context. These .NET applications grew organically and were written by developers who were more comfortable with SQL than .NET. The application has shortcomings that you might expect: lots of logic in stored procedures, poor separation of concerns between SQL and .NET, and a presentation layer (ASP.NET web forms) that’s highly coupled to the database.
And the application has a lot of room to grow. It is ripe for becoming a data warehouse solution. I have some mixed experience setting up data warehousing from a previous position, but this time, the team is blessed with a talented database developer onboard to guide us through the tough parts. I can’t wait!
The concern I have is being spread too thin. As you can see, there are a number of vectors to pursue. I need to choose one, maybe two of these and drive them to completion before taking on others. I haven’t talked about the nine months of architecture changes that are needed to get the application able to scale to handle the desired demand or some of the expanded features they want for this application this year.
If I have learned anything over the past couple of years, it is that I cannot be all things to all people. I need to pace myself and be ready for some false starts. I also find a way to handing off these projects when (if?) they get off the ground. Here’s where having a good manager will help me, and I am blessed to have one.
I feel energized, and excited about the possibilities!
At work, I had the need to parse through a large pipe-delimited file to count the number of records whose 5th column meets and doesn’t meet my criteria.
This command does what I want, returning output like this:
Count Name
----- ----
1129339 True
2013703 False
Here’s some explanation in English, for those of you who don’t know PowerShell.
The first command is gc (Get-Content), which reads the file in 1000 (readcount) lines at a time.
The second command is ? (Where-Object), which filters out the HEAD row.
The next command % is an alias for Foreach-Object, where object in this case is a 1000-line chunk. The inner loop is another foreach loop, which is slightly different from Foreach-Object in ways that are unimportant to the matter at hand. Point is, you can’t nest % blocks. The block of the foreach loop splits each line by pipe delimiter and returns just the 5th column (first column is numbered 0).
The next command in the chain is group, an alias for Group-Object, in this case we’re grouping by a calculated property, whether the output of the previous command is greater than or equal to 256. By saying “-noelement”, I’m saying I don’t need an enumerated list of the values, which in this case are unimportant.
Finally, we get to ft (Format-Table). It is necessary because the Count column may be over 99999, in which case the value is truncated. The option “-autosize” causes PowerShell to make it fit instead.
However, for a 500 MB test file, this command takes about 5.5 minutes to run as measured by Measure-Command. A typical file is over 2 GB, where waiting 20+ minutes is undesirably long.
I posted a query to StackOverflow for some ideas.
While I waited, I discovered that 2500 was the optimum value for -ReadCount, getting the command execution time down to about 3.5 minutes.
Within minutes, I got a helpful hint from Gisli to look into using the .NET StreamReader. Here’s what that Show-SourceCounts script looks like:
This script yields the same results as the first command, but in about a minute. Quite an improvement!