Converting a PowerShell script into a pipeline processor

2012-09-14
Chuck
4 mins

Ever had a situation where you write a program or script, but it takes forever to execute because of the size of the data set? That happened to me recently.

At work, we work with fixed-delimited files on occasion. These files represent all the relevant contract data for a company, so they can grow quite large. The one I’m working with today is 1.4 GB in size.

I already had a PowerShell script that would parse the file, but run against this file, it reduced my machine to a crawl eating up memory before it finished executing 2.5 hours later.

 param($file = (Read-Host "Header file"))

$lines = @()

gc $file | select -skip 1 | % {  
 $contractId = $_.substring(6,6).trim()  
 $endDate = $_.substring(12,10).trim()  
 $startDate = $_.substring(22,10).trim()  
 $description = $_.substring(156,80).trim()

$data = New-Object PSObject -Property @{  
 ContractID = $contractId  
 StartDate = $startDate  
 EndDate = $endDate  
 Description = $description  
 }

$lines += $data  
 }

$lines

Here’s the command line I used:

PS> .\Parse-ContractHeader.ps1 HEADER.file

As you can see, this script reads the file line by line (after skipping the header row), creates an object for each line, puts the object into an array, and then returns the array. This approach works with the data set is small, but it breaks down at some point when the data becomes large.

What I needed was a script that would input and output to a pipeline, so I’d get the results one line at a time instead of all at once.

In PowerShell, there’s a BEGIN-PROCESS-END structure you can use to handle pipeline input and output. Since my script doesn’t need to do anything before or after processing the stream, all I need is a PROCESS block:

 PROCESS {  
 $mfgId = $_.substring(0,6).trim()  
 $contractId = $_.substring(6,6).trim()  
 $endDate = $_.substring(12,10).trim()  
 $startDate = $_.substring(22,10).trim()  
 $cc = $_.substring(127,2).trim()  
 $description = $_.substring(156,80).trim()

$data = New-Object PSObject -Property @{  
 ContractID = $contractId  
 StartDate = $startDate  
 EndDate = $endDate  
 Description = $description  
 }

$data  
 }

The command line is a little different now. I could have put these parts into a BEGIN block, but I wanted the additional versatility of being able to change the input stream.

PS> gc HEADER.file | select -skip 1 | .\Parse-ContractHeader.ps1

By the way, I also recently learned that if you have a script like Parse-ContractHeader.ps1 in your path, you can just refer to it like a built-in cmdlet, like so:

PS> gc HEADER.file | select -skip 1 | Parse-ContractHeader

After I wrote the second version of this script, I found a PowerShell pipeline script template, which offers the added advantage of being able to elegantly handle both pipeline and parameter input. If I need to do more with this script, I’ll give the template a shot.

Steps To Improve Testability

2012-09-12
Chuck
9 mins

Last time, I talked about testability and why it’s important. Today, I continue the conversation with techniques to improve testability.

Steps Forward

Though I’ve used different technologies through my career to achieve these benefits, I use examples from my current team.

Learn the application and the business

All but one of us is new to the team, so we’ve had to learn how contract pricing works and how our application implements it. We take every opportunity to talk to support personnel, trainers, product owners and customers to learn more about the domain.

As we do so, we imagine new ways to test the application. We’re starting to ask better questions, and that is leading to more meaningful tests.

Defining a standard test data set

The current automated test suite does not offer adequate scenario coverage. For example, there are manual steps required to set up simulated customer data feeds. The team has been working on a comprehensive gold standard data set that can be used for tests, as well as automated setup and teardown of test data.

The thought is, if the database is in a known state prior to testing, we’ll have less test failures due to abberant data conditions. For this application, with the nature of its data and number of existing tests, I feel confident that we’ll continue to use this technique. However, there are other ways to inject data into tests, discussed below.

Write tests for new features and defect fixes

This almost goes without saying. Even though there is a speed cost in doing so, we avoid incurring additional test debt by making sure our new code is tested. These tests conform to our latest thinking around how to test the application.

Beginning to form a regression test suite

The system uses standard data file formats to import data. Today, we take a scrubbed copy of the production database, look for a customer with a similar data scenario, and writing the tests to use that customer.

This strategy fails as we wish to refresh from production. Sometimes, a customer’s business model changes and their data no longer fits the profile a test scenario needs. In other cases, the contracts have been renewed or have expired, so we need to find new data scenarios and retrofit the tests to use the new data.

To lessen our dependence on external data imports for current records, the team has invested in tools to assist in the creation of simulated data import files. For example, it will be invaluable to be able to take one customer’s data set, tweak it, and then import it as a different customer. Another tool to update the effective dates of some contracts in bulk according to some data scenario rules will similarly prove its worth.

Using the best testing technologies for the testing scenario

NUnit may not be the best vehicle for testing data-intensive applications. Rather than use NUnit as a database script automation engine, the team has researched frameworks like dbUnit, which allow creation of more direct tests in SQL, a more suitable language.

We’ve also been looking into using FitNesse to automate business logic tests. This will allow QA and Product to help developers improve our automated test coverage through writing test specifications on the Wiki.

Use a continuous integration solution to automate test execution

We’ve been expanding the role of TeamCity in our development environment. From running just the unit tests, the team has added builds that:

deploy database revisions to our shared development environment
deploy the application to shared development
run code coverage using dotCover, which is bundled with TeamCity
run a code duplication detection tool
run the FitNesse test suite periodically

If you haven’t looked into the automation capabilities of a continuous integration tool, I suggest you do so.

Use transactions in unit tests to avoid causing side effects

As mentioned before, a number of our unit tests do not restore the database to its initial state, which creates timing dependencies in our test suite. Database transactions can solve this problem if the tests are batched inside a transaction. If the transaction is not committed, then the changes are never written to the database.

A database testing framework like dbUnit does this automatically. I’ve done this manually in some cases, though I’ve found that writing tests by hand that read a value, update a value, then read it again in a single transaction can be tricky.

To avoid failures from bad data conditions, some of our older tests wipe the entire database clean and loading seed data from scratch. While effective, in a suite of a thousand tests, this is time-consuming and often is overkill! Worse, other tests don’t concern themselves with data setup at all and just expect the database to be in a certain state by the time they run!

When we encounter errant tests, we’ve been ensuring that the tests set up their own data and call appropriate clean-up scripts. Work includes getting subsequent tests to do their own data setup instead of relying upon the side effects of previous tests. Although it is slow, at least the tests are independent of each other. We also consider bundling the data setup of tests into feature sets, to reduce the number of times the data in the tables needs to be wiped and reloaded. If we decide to adopy dbUnit, we’ll use the list of poorest performing tests as a place to start.

Use a separate database for unit testing to avoid causing side effects

As you might expect, setting up and tearing down the database would make the user interface almost unusable. Since we also need to do exploratory testing and other groups need to integrate with us, we set up independent database that the unit tests run against.

With multiple databases in multiple environments, change management is a concern. We write database change scripts in SQL, both forward and backward versions. Each database has a table that contains database version information. We have a C# module that knows how to upgrade or downgrade a database given a database and a target version number. For the most part, this strategy works very well.

Having the application itself keep track of the different database contexts is complicated. In some test suites I’ve seen, a base class is used to provide configuration and test data for a suite of tests. Because C# does not allow multiple inheritance, this design can be a drawback when testing different parts of the application under the same data conditions. Tests for the contract expiration algorithm might appear in multiple test suites, for example.

To combat this, we use Ninject as an IoC container to separate configuration from usage. Inversion of control is a powerful technique that deserves its own article, so I’ll delve into this topic later.

Mock the data access layer to remove the database dependency

For logic that lives in the C# code atop the database, rather than setting up data in the database and performing an integration test all the way down into the database, we can mock the data access components. We’re using Moq as our mocking framework. Mostly through constructor injection, but also through other Ninject injection patterns, we provide our domain objects with mock data access layers.

A growing fraction of our tests don’t need to talk to the database at all. Most of our newly-authored tests follow this doctrine, though adoption has been slow as we do need to be cognizent of the risks of refactoring for testability without tests in place to validate the refactoring itself.

Move business logic into the domain layer where practical

We have also found several cases where business logic was placed in the database because it was the language previous teams were more comfortable with, not because it was the most natural place for the logic to reside.

This manifests in our code as several lookup queries that get repeated throughout stored procedures. These queries gather data for what should be domain objects. Then, these stored procedures modify the data as methods on an object would do. The insidious part of this design is that the stored procedures modify the data in different ways that often should logically be equivalent, except when there’s a business need for variance. As a result, subtle bugs emerge on occasion.

I believe in Command-Query Separation (CQS). I think that commands with their complicated decision trees are often better expressed in an object-oriented language like C# or Java than a record-based relational database query language like SQL. Conversely, I think queries and other set-based operations are best expressed in SQL – this is what database engines are optimized for.

With the advent of LINQ, though, I find myself making more use of database views for filtering and transforming data instead of stored procedures. I reserve SQL stored procedures for more advanced situations, especially where performance is a concern.

By consolidating our business algorithms into the domain layer of the application rather than the database, we gain multiple benefits. Separating commands from queries makes the algorithms less complex. We lessen the number of lines of code we support through elimination of duplication. Where variations are necessary, we highlight the differences and the resulting algorithms become clearer. And, through isolation of algorithms into independent objects, we make the algorithms themselves easier to test.

Conclusion

I’ve discussed how lack of testability can affect the ability of teams to move their product forward. I discussed some of the challenges faced by the application throughout its life. Then, I used a job experience to illustrate strategies teams can use to get their application under test.

In the past, these strategies have been effective in taming the savage untested beast. For further reading, I recommend Working Effectively With Legacy Code or [Emergent Design: The Evolutionary Nature of Professional Software Development] by Scott Bain (http://www.amazon.com/Emergent-Design-Evolutionary-Professional-Development/dp/0321509366/).

What strategies have you used? Which were effective?

Testability in Data-Driven Applications

#testing

2012-09-05
Chuck
4 mins

At work, there’s a renewed interest in testability (I’ve taken experience at a few positions and combine it into a narrative.)

Testability

I’m a member of the third team to work on our project. The original team has left, the people who replaced them have moved on, and now we maintain and enhance this application. Often we find ourselves doing forensic analysis, trying to divine the reasoning behind some design choices.

For example, there was a case recently when we were looking into a defect and found that the logic causing the defective behavior was quite intentional. No one on the team nor the current product owner understands why anyone would ever have wanted the system to behave this way. When faced with a situation like this, tests are a saving grace.

Sadly, the previous teams made some different choices. This product began in a smaller, less structured organization. What we now think of as cowboy coding was de rigueur back when this product was designed and first implemented. The second team, both to evolve with the company – and for their own sanity, I’m sure – implemented some automated testing and were faced with a challenge.

The application imports contracts and provides accurate price quotes. It is architected as an n-tier web application, but its purpose has evolved into a data warehouse. More and more business logic has moved into the database, yet the unit tests are written in NUnit and C#. As you have guessed, many of these tests are not unit tests in the common sense of the word. They are integration tests and rely heavily on having test data in certain conditions.

Consequently, the tests go through a Byzantine setup and teardown process that makes the tests very slow and brittle. A number of tests don’t clean up properly after themselves, so the suite needs to be run in a certain order to ensure success. In short, adding new tests or changing the data access layer more time-consuming and complicated that it should be.

So, here we are, the third team. We experience a dynamic tension between two forces. One is Product, who wants to see competitive market features commensurate with a data warehouse solution – faster processing of contract changes, better analytic reporting, and the like. The other is Development Management, who want to see industry-standard rigor around programming – fewer defects, curtailed production access and automated testing that can plug into an end-to-end testing regimen of our integrated suite of products. Both groups want to see more predictable and faster feature delivery as we the third team learn about the software and the domain. Today, our most conservative estimates turn out to be optimistic.

The Crux of the Matter

Looking at the situation, one can rightly say that the application’s biggest issue is that it’s hard to test. The application calculates correct pricing for supplies based on a number of applicable contracts. As you can imagine, the intersection of these contracts can be complicated, so there are a variety of factors and adjustments that need to be considered. Each one of these data scenarios should be tested before releasing a new feature.

Let’s take unit of measure for example. Let’s say I’m buying gum. Gum could be sold in individual pieces (known in UOM terms as an “each”), in packs which I’ll call packages, and packages are often shrink-wrapped and bundled into what I’ll call sets. As a glimpse into how varied item packaging can be, please consider looking at the 55-page GS1-compliant brand label placement guide PDF for consumable goods found in grocery stores.

Let’s say as a buyer, I want to know the least expensive way I can get 17 sticks of Yummy-Yummy Bubble-Gummy Gum. I have a few avenues to pursue. I can get Yummy-Yummy:

directly from the manufacturer Gobstobbers International,
from the Save-A-Lot wholesaler,
the convenience store Quik-Bite, or
my local grocery store Groceries eXtreme

Some of them stock 3-pack sets of gum. Others stock 5-packs. At Quik-Bite, I can get individual packages. And, it turns out Gobstobbers International has a deal where they will create custom-size and branded packages like 17 Yummy-Yummy sticks to the package, but only if you order a large enough quantity of them.

To further complicate matters, depending on how many 17-stick units I need, I might be better off buying a box or a case and splitting it myself. And, now that we’ve discussed quantities, it’s worth mentioning that all of these outlets sometimes offer sales, volume pricing, and special discounts to select customers.

Imagine writing an application that decides who to order gum from. There are a lot of data scenarios to consider!

It’s like that with our application. We need a way to run a myriad of data scenarios through our pricing service to make sure we get the expected results.

Next time, I’ll talk about techniques I’ve used to improve the testability of applications.

Group Coding Exercises: the Code Dojo

2012-08-29
Chuck
5 mins

Lately, I’ve talked about book clubs and code katas. While talking about code katas, I said:

For a long time, I thought that code katas were puzzles that people solved together. I’ll describe a few of those exercises in another article. Group coding exercises are fun but have different goals.

This post explores group coding exercises.

Code Dojo

Group Coding versus a Kata

Katas are about refining technique through repetition. Group coding exercises offer no repetition. Once the objective is reached, few participants revisit the problem, certainly not over and over. And, a group coding exercise is not about refining a technique. The act of refining implies the removal of impurities or to improve distinctiveness or accuracy.

If not these things, what is the value in a group coding exercise?

Dojo

The word dojo simply means “room” or “hall”. In taekwondo, we call it a do-jang. It’s a place we get together with fellow martial artists and practice. Some of us are white belts, some are colored belts, and some are black belts.

There’s this idea in a do-jang that we will all practice to our abilities and we will each learn lessons appropriate to our rank. White belts get to see where taekwondo can lead them. Many won’t like what they see and they leave and that’s okay. Some find something they like and will choose to stay, and that’s okay too.

Colored belts not only get to see their progression, but many seminal moments in their taekwondo journey will happen in a do-jang. There will be a time where they succeed in performing a kata in front of the class, where they will have the instructor demonstrate a technique for the class. There will also be a time where they slip and fall in front of the class, when they misstep and hurt a classmate a little while sparring. Both kinds of experiences are crucial to growth, and a good do-jang offers them in spades.

People learn best when their chances of success and failure are about the same. Conditions where you succeed a lot tend to be monotonous, ultimately leading to hubris and stagnation. Conditions where you fail often are perplexing, infuriating, then discouraging – maybe to the point of apathy – as nothing you try yields the results you want.

Black belts in a do-jang get hands-on experience in teaching others. As teachers have known for a long time, one of the best ways to learn something is to teach it to someone else. Gaps in knowledge get exposed, and it can be unnerving to people who have risen through the ranks. This need to teach lower ranks can cause the most coordinated black belt to despair, because physical and mental strength is no longer enough. A good black belt has social strength and a sturdy character, which is something that can’t be taught but can be grown.

At the same time, black belts are still students. They are exposed to a wider variety of material than they ever were in colored belt ranks. Some find this overwhelming. My teacher, Ms. Moormeier, often used to say that to her, getting a first degree black belt was like graduating high school, and that once you earn your black belt, you know the basics and the real learning can begin.

Code Dojo

Group coding exercises are also known as code dojos. A code dojo is meant to be a safe place to try new programming ideas. Where a code kata is a place to hone skills, a code dojo is a place to learn new skills from each other.

As with a code kata, a code dojo starts with a group of programmers and a problem to solve. Like a code kata, the problem is usually simple but not trivial to solve. Unlike a code kata, it often takes longer to solve. Often, it has nothing to do with the daily work, because it’s valuable to spend time [barking up the wrong tree] (http://www.bakadesuyo.com/what-does-it-take-to-become-an-expert-at-anyt) on the way to expertise.

In a code dojo, the group codes up different solutions to the same problem. Sometimes, the people take turns in front of a projector working on an implementation. Other times, they go off on their own or in pairs and work on a solution. In my groups, people often use different languages, but the group may decide to all use one programming language.

The important part comes afterward. The group come back together and discusses the problem and the presented implementations. As each person presents, the group talks about each solution’s relative merits. Inevitably, people find an approach fascinating. Sometimes, it’s because it’s an approach that have never seen or wouldn’t have thought of. Other times, it’s because someone took the same approach and there are deep similarities between that solution and their own.

White belt programmers get validation that they do indeed know how to code something. They also learn how more experienced programmers deal with criticism and debate about their creations. Colored belt programmers get exposed to a variety of techniques and their career begins to branch out in unexpected and fruitful ways. Black belt programmers learn to mentor other programmers.

The Value of Repetition

While it’s true that a group doesn’t often repeat the same exercise over and over, I have found it instructive to try the same problem more than once with different techniques.

For example, I’ve tried the Secret Santa problem in PowerShell, C#, and F#. The F# post describes the different in approaches each time. It’s true, there is no single solution to a computing problem!

Feedback

Have you participated in a code dojo? What has worked for you? What hasn’t? What’s your favorite group coding exercise?

Code Katas

2012-08-21
Chuck
3 mins

In my last post, I talked about book clubs. In it, I touched on code katas.

For a long time, I thought that code katas were puzzles that people solved together. I’ll describe a few of those exercises in another article. Group coding exercises are fun but have different goals.

Code Kata

Katas

Katas are a concept from martial arts. I’ll speak from a taekwondo perspective, since I have a second degree black belt. As a white belt, you start off learning some of the Korean terms we use in the do-jang (gym) and a couple of basic stances. Next, we teach you how to punch coreectly and how to block a front kick, which is a common kind that’s also taught to white belts. With these three moves, we teach a beginning exercise called four-direction punch.

It’s not too exciting to watch, except as a parent of an enthusiastic white belt. (Or, watching a master do it.) These are the moves: you punch, you turn in stance, you block, you step forward and punch, and so on, until you’ve faced all four directions.

At first glance, it may not seem very useful to learn this. You can’t compete in tournaments with it, and it won’t save you in an altercation on the street. Alone.

Alone, it won’t do those things. But, practiced regularly, an exercise will help hone your taekwondo. To do this exercise correctly, you need to punch just so, turn just so, land the stance just so, and block just so – and you need to mean it. Crisp, precise, powerful technique is the demand.

Katas, with Code

Code katas are the same. They involve simple goals, implemented the same way, over and over. Jim Hood is the one who taught me the true meaning of a code kata.

Every morning, at a prescribed time, in a prescribed conference room, Jim leads people through the same kata. They work on speed, but not rushing. They work on test-driven development, not because they don’t know how to implement the algorithm, but precisely because they do know how to write a prime number generator.

The Benefits

Programmers should program code katas for the same reason that martial artists should practice their forms. The form drills the moves into your brain and into your muscle memory. When you are tired after the initial flurry of blows while sparring or when you distracted and caught unaware in a serious situation, you will not spend precious heartbeats thinking about how to respond. You will do what you’ve trained to do.

When you, the programmer, are stuck on how to implement the complicated business rules your product owner just explained to you, you want to implement well-designed, test-driven code. A code kata is the foundation for solid code implementations.

Attributes of a Good Code Kata

I’ve already mentioned one example. The prime number generator is a good example. It’s:

very small scope (one function)
stands alone, requires no infrastructure
can be covered by a handful of tests
can be implemented in 10-15 minutes

Some Examples

The Fizz Buzz Test is a staple of programming interviews, and it meets all our criteria.

The Project Euler problems contain a number of candidates. Jim may have gotten his prime number generator kata from here. Prime composite factors, palindromic numbers, and Pythagorean triplets might also make good ones.

The kata catalog on Coding Dojo has a number of good problems to solve. The Potter kata and Roman numerals kata look promising.

These 15 exercises to know a programming language are good for learning a language, but I think they might be a little long for a code kata. If you have the time, though, these exercises being up important language concepts.

Peter Provost has a good blog post about katas, in which he talks about them as a vehicle to learn TDD.

Do you have a favorite kata? Please share it in the comments.

Prev Next