Saturday, July 31, 2010

Business Process Modeling (BPM) with .NET Workflow Foundation (WF) 4.0

I'm currently reading the book Pro WF: Windows Workflow in .NET 4. As a short intermediate review: It's a great resource! I'll write a complete review when I'm done with the rest of the book. I'll also come up with a post that describes a complete overview of Workflow Foundation (WF) 4.0.

This post provides a simple example how WF can be used by our clients to apply real Business Process Modeling (BPM).

Guess we are developing a sales order system. Depending on an orders net-amount the order gets a discount. The calculation of this discount is usually best done within source code (at least in my opinion). So here is a class that represents our default implementation.
using System.Activities;

namespace WfActivityLibrary1 {
   public sealed class DefaultDiscount : CodeActivity {
      [RequiredArgument]
      public InArgument<Order> Order { get; set; }

      protected override void Execute(CodeActivityContext context) {
         Order order = Order.Get(context);

         if (order.NetAmount >= 1000)
            order.Discount = order.NetAmount * 0.2m;
         else if (order.NetAmount >= 500)
            order.Discount = order.NetAmount * 0.1m;
         else if (order.NetAmount >= 100)
            order.Discount = order.NetAmount * 0.05m;
         else
            order.Discount = 0m;
      }
   }
}
The used base class CodeActivity represents one of the Workflow Foundation base classes that can be used to implement custom workflow activities. Usually I'd move the business logic into a class within the Business Logic Layer (BLL) and consume that class within the WF activity. But this would increase the complexity of this sample with additional layers and without any real worth about working with WF, what's the topic of this post.

The problem with discounting is, operational business is very dynamic and the client usually needs to be able to provide selling campaigns to her/his customer. A campaign can be "20% for everything" or "50% of for premium customers".

Until now, this was the point where we started to create dozens of mapping tables and much more classes, containing hundreds of if-statements, which covered the different types of clients discounts. The client got some new masks within the application to manage all those new master data or (worse) a way to import some outside maintained Excel sheets.

However, what if the client wants to provide a new discount related on information that are not yet covered by our "discount calculation"-subsystem within our software? We started to add more master data tables, more classes, more if-statements and more master data masks within the application.
Humans are creative. Especially sales people, when they try to find a new ways to sales their products. (What's a good thing!) So requirements like calculation of discounts for selling-out campaigns are often spirals that never end.

Business Process Modeling

Often clients do not for complicated master data table relations or huge complexity within their system (which is often expensive to realize). They don't understand why it is so complicated to combine discount rates dynamically in different ways. And from a workflow based point, they are right though.

Instead of all those master data tables, and mask, the client often would prefer some way to really dynamically define their discounts with some simple components. That's the point where an application, that supports those features, enters the space of Business Process Modeling (BPM). BPM enabled software represents a huge benefit to the client, since (s)he does not longer need to specify a new change request for the software vendor whenever business changes. Instead of immutable behavior, the software is able to be adopted into the changing business processes.

BPM with WF

Based on our initial sample, let's take our already existing "Default Discount" workflow activity and one additional activity called "Fix Rate Discount". The "Fix Rate Discount" gets two arguments:
  • Order: The order to be discounted
  • Percentage: The percentage rate to apply to the order
I kept the source code of the "Fix Rate Discount" activity away, since it is really simple and does not relate to business modeling at all.
Armed with those two (notice, only two) workflow activities it's time to review our two selling-out campaigns.
  • 20% for everything
  • 50% of for premium customers
WF provides the possibility to define new workflows not only in Visual Studio but in any kind of .NET client application, simply by hosting a special WF designer control. So we can include this control into our application (or an external BPM module) and give the client the possibility to model their business by utilizing any defined activities. Even more, each new workflow represents an activity too and can be reused within any other workflow.

WF provides two different kinds of diagrams. Sequence diagrams represent a sequence of activities, executed top down. Flowchart diagrams are like any kind of well known flow chart out of tools like common BPM tools, Microsoft Visio, and most other modeling tools. Both diagram types can be easily combined with each other. Both diagram types support activities like "if"-decisions or several different loops.

I've chosen a sequence diagram for our "20% for everything" selling-off.


Here you see an excerpt of the WF workflow designer and related toolboxes. Certainly the main component is the diagram designer. Activities can be dragged into the designer and configured. The left toolbox (in my configuration) shows all workflow activities that are currently available. The properties window on the right side can be used to configure distinct activities within a diagram and exchange data. The tool-window at the bottom is used to define input and output arguments of the current workflow. For instance, the current workflow expects an input argument of type "Order" called "OrderToDiscount". Those arguments are provided at runtime when the system calls the workflow. Within the diagram the "DefaultDiscount" activity represents the systems implemented default discount calculation. The "Add 20% Discount" activity represents a "FixRateDiscount" activity where I changed the name to be more meaningful within the diagram.
For the "50% of for premium customers" workflow, I've chosen a Flowchart diagram. Not because it wouldn't be possible to do this with a Sequence diagram, but to show the other diagram type.



The "Decision" activity represents an "if"-decision that determines if the customers category is "Premium". For sure, in a real system this information evaluation should be encapsulated in a custom activity. The "DefaultDiscount" represents our system default discount calculation. The "50% Discount" activity is another instance of our "FixRateDiscount" activity.

Instrumentation Instead of Development

Nevertheless, be careful with too many logic within those workflows. Due to some very powerful activities like "Assign" to assign any value to any variable or property of an object or "Invoke" to invoke any method on an object instance, you can use WF to develop almost everything. In my opinion, this is the wrong way to use workflow engines. Those tasks are usually better placed in source code.

I see a huge strength in workflows to enable clients to change the processing behavior of a system. I don't see a worth in letting the client develop her/his system. Usually they neither do have the skills to do this, nor they are interested in tiny low level tasks within their models.

Monday, July 19, 2010

If You Cannot Use Identities; Cache Your Ids

Sometimes it is not possible to use IDENTITY columns in database tables. Unfortunately, one of the most common approaches to handle new IDs in non IDENTITY columns is working with MAX(). I just saw this yet another time in a blog post I don't want to refer here. In my opinion, the generation of unique ID values is quiet basic but should be done correct, so this blog post is all about this topic.

If you are able to use IDENTITIES in all tables, you are find and you can finish reading here. :-)

Reasons Not To Work With Identity Columns

One reason might be, that the used framework does not support them, like Entity Framework in combination with SQL Server Compact. SQL Server Compact does not support batch execution, so every query allows only one SQL statement and EF does not support a second query for each INSERT to determine the last generated IDENTITY value.

Another reason not to work with IDENTITY columns are scenarios where computed IDs are required, like in distributed systems. I used them to guarantee a unique PRIMARY KEY over all databases (all over the world) where the first part of the primary key was an incrementing BIGINT value and the second part was a SMALLINT, describing the database where the table row was created. Some of you might argue, that I would have also been able to use a Guid, however I still don't like them in databases since they are way too large and too slow.

The MAX() Approach

A very common approach to generate new ID values is to select the currently maximal value of the table where new rows have to be inserted.

Here a plain SQL sample.
-- ======================================================================
CREATE TABLE Orders (
   Id INT NOT NULL PRIMARY KEY CLUSTERED
   ,OrderDate DATETIME2
);

-- ======================================================================
-- some existing rows
INSERT INTO Orders VALUES (1, SYSDATETIME());

-- ======================================================================
-- generate new ID values, using MAX
DECLARE @newId INT;

-- select current maximal ID + 1
SELECT @newId = ISNULL(MAX(Id), 0) + 1 FROM Orders;

INSERT INTO Orders VALUES(@newId, SYSDATETIME());
Unfortunately, this approach has some issues. First, let's have a look at the performance. Each call of MAX() causes an index scan of the primary key, as shown in the below execution plan. SQL Server 2008 does a great job, avoiding to scan the whole index, however this scan for each new row should be avoided.


The second, more important issues is concurrency. If we are working with SQL Server Compact, this issue does not exist, since the DBMS supports only one current connection at the time ;-). In every other environment, where we have more than one concurrent user connection, we might (and occasionally will) get an error if there are two concurrent inserts and both connections are at the same time between row 16 and row 18 of the above sample script. Both will get the same MAX ID value from the table, but only one of them can use the new calculated ID for insert. The second one will retrieve a primary key violation.

Using an ID Table

To avoid both previously denoted issues, we can use an ID-table, containing the name of the table to create a new ID for and the next available ID.
-- ======================================================================
-- id table
CREATE TABLE Ids (
   Name VARCHAR(30) PRIMARY KEY CLUSTERED
   ,NextId BIGINT NOT NULL
);
-- ======================================================================
-- init ids for Orders table
INSERT INTO Ids VALUES('Orders', 1);
GO
-- ======================================================================
DECLARE @id BIGINT;
-- get the next available "Orders" id from Ids table
UPDATE Ids SET
   @id = NextId
   ,NextId += 1
WHERE Name = 'Orders';

INSERT INTO Orders VALUES(@id, SYSDATETIME());
Usually, we would move the ID allocation into a procedure, which I left out for sake of brevity. This already looks like a much better approach, doesn't it?

However, till now, this solution brings up another issues. Now the whole system has to access (with a write lock) the ID-table for each single row inserted into any other table. This usually causes a system wide performance issue when system is on load. The solution for this problem is to cache IDs as explained below.

ID Caching

To avoid the access problem with the ID table, clients can use a caching mechanism. This means each client avoids to allocate new IDs one by one, but always allocates a larger count of IDs in one step and works with them for the next new rows to be inserted.

The count of IDs to be allocated in one step depends on the client and the count of row it usually creates. A GUI like a web- or windows-application is often fine with an ID cache size of 10 where a data loader process might need a cache size of 1000.

For sure, this causes that some ids are never used. E.g. if a web application allocates 10 new IDs but the user enters only one new row, 9 IDs are lost. However, this doesn't really matter since primary key IDs are only needed for uniqueness and should never be used as shown sequence in the client.

The easiest way to allocate new IDs is a simple procedure which gets the name of the table to allocate new IDs for and the count of IDs to be allocated and returns the next available ID to be used.
ALTER PROCEDURE AllocateIds
   @nextId BIGINT OUTPUT
   ,@name VARCHAR(30)
   ,@count INT
AS
SET NOCOUNT ON;

UPDATE Ids SET
   @nextId = NextId
   ,NextId += @count
WHERE Name = @name
GO

Client Side Caching

Caching IDs in a client application is usually quiet easy, as long as the client is stateful like usual web- or windows-applications and most web services.

This IdGenerator is a very simple sample of how to cache IDs in the client, utilizing our previously shown stored procedure.
static class IdGenerator {
   class IdEntry {
      public long NextId;
      public long MaxId;
   }
   // table depenant cache
   static IDictionary<string, IdEntry> _cache = new Dictionary<string, IdEntry>();
   // connection and allocation information
   static string _connectionString;
   static int _allocationSize;
   // initializes the connection and allocation information
   public static void Init(string connectionString, int allocationSize) {
      _connectionString = connectionString;
      _allocationSize = allocationSize;
   }
   // public interface to allocate new ids
   public static long NextId(string tableName) {
      IdEntry entry = GetEntry(tableName);
      if (entry.NextId > entry.MaxId)
         AllocateNewIds(entry, tableName);
      return entry.NextId++;
   }
   // get an id entry for a specified table
   private static IdEntry GetEntry(string tableName) {
      IdEntry entry;
      if (!_cache.TryGetValue(tableName, out entry)) {
         entry = new IdEntry { NextId = 1 };
         _cache.Add(tableName, entry);
      }
      return entry;
   }
   // allocate new ids from database
   private static void AllocateNewIds(IdEntry entry, string tableName) {
      Console.WriteLine("Allocating new ids from database");
      using (SqlConnection connection = CreateConnection())
      using (SqlCommand command = new SqlCommand("AllocateIds", connection)) {
         command.CommandType = CommandType.StoredProcedure;

         SqlParameter nextId = command.Parameters.Add("@nextId", SqlDbType.BigInt);
         nextId.Direction = ParameterDirection.Output;
         command.Parameters.Add("@name", SqlDbType.VarChar, 30).Value = tableName;
         command.Parameters.Add("@count", SqlDbType.Int).Value = _allocationSize;
         command.ExecuteNonQuery();
         entry.NextId = (long)nextId.Value;
         entry.MaxId = entry.NextId + _allocationSize - 1;
      }
   }
   // create a new, open database connection
   private static SqlConnection CreateConnection() {
      SqlConnection connection = new SqlConnection(_connectionString);
      connection.Open();
      return connection;
   }
}
... and a sample, how to use the ID generator...
IdGenerator.Init("Server=.;Database=Sandbox;Integrated Security=sspi;", 10);
for (int i = 0; i < 100; i++) {
   long id = IdGenerator.NextId("Orders");
   Console.WriteLine(id);
}

ID Tables And T-SQL

Well, while ID-tables and caching are a good way to go in client applications, they are a bit tricky in T-SQL (and probably in most other SQL dialects). The problem is that T-SQL is stateless and due to this fact it does not support caching, since we don't have that "static" place where we can store our cached IDs.

One workaround is to create a temp table that looks like the stateful ID-table and represents our "own" cache. But, due to the scope handling of SQL Server it is not possible to move the temp table creation into a procedure, so this workaround is quiet awkward to maintain since all scripts/procedures that want to a client like caching have always to create their temp table by their own.

However, usually T-SQL scripts should never try to work like clients - in a row based manner, so they should not really need a ID cache. Whenever new rows have to be inserted into a table, the executing procedure/script should determine the full count of needed rows, allocate the required range of IDs and insert all rows in one operation.

One thing that changes when working with ID-tables is, IDs for single row inserts should be provided from client now. Why? Because of the missing ability to cache IDs in T-SQL. Where we usually had insert procedures lie this...
CREATE PROCEDURE InsertOrder
   @id INT OUTPUT
   ,@orderDate DATETIME2
AS
   INSERT INTO Orders (OrderDate)
      VALUES (@orderDate);
   SELECT @id = SCOPE_IDENTITY();
... we would have to act like this...
CREATE PROCEDURE InsertOrder
   @id INT OUTPUT
   ,@orderDate DATETIME2
AS
   EXECUTE AllocateIds @id OUTPUT, 'Orders', 1;
   INSERT INTO Orders (Id, OrderDate)
      VALUES (@id, @orderDate);
... but this would cause a huge traffic in our ID-table. So it is usually more productive to get the IDs from the client, which can easily handle the caching.
CREATE PROCEDURE InsertOrder
   @id INT
   ,@orderDate DATETIME2
AS
   INSERT INTO Orders (Id, OrderDate)
      VALUES (@id, @orderDate);
Some might say, that they don't want to give clients the force to handle the new primary key values. Though, SQL Server enforces the uniqueness for you ;-) and the price to pay is quiet cheap, compared to the other option.

Stateless Clients

Stateless clients have almost same issues like T-SQL procedures and scripts. Due to their statelessness, they are usually not able to cache their IDs. The only additional option we got with stateless clients is to move the ID caching into an external resource like a WCF service. If an system has several stateful and several stateless components, it is an option to keep the stateful applications fast by providing a separate ID cache for those that are stateless.

For sure, you could consume this service also from T-SQL (quiet simple with SQLCLR as I showed here) but SQLCLR to access web services is still rare used and the overhead is usually way larger than directly accessing the ID-table. So, the service should only be used from client side to keep the SQL out of the ID generation.

Friday, July 9, 2010

Defend Your Code

Ever explained a current bug or dirty implementation with one of those sentences?
  • "The (project) manager/marketer said, we have to meet the deadline."
  • "The (project) manager/product owner told you, a quick and dirty solution is okay for now."
  • "The XYZ told you, we NEED this new feature, no matter how it works behind."
Never? Though, I did, not currently but I did.

Who was responsible to the current problem? It's me. I treated non code-specific requirements over system quality. I did not write the unit test which had found the problem. I did not invest the time for refactoring to ensure a clean implementation. You might argue that somebody else decided. Nevertheless, it's my, and only my, task to keep the system clean, free of bugs and maintainable.

I see software projects are like a play and we, all the stakeholders, are the actors/actresses. The part of the marketer is to push forward for new features. Part of the project manager, managers and product owners is to obsess the schedule. And it's our part to defend the code!

What about "If I did not implement this feature in time, I'd got fired"? Probably not. Most Managers don't want buggy software or software that is hard to maintain for future requirements. Even if they don't say that very clear. Managers, and all other non developing stakeholders, just cannot rate importance of one or more unit tests and have no clue about refactoring (even if they say they do have).

If I would be an electrican, the building owner and the lead worker would always insist on the deadline. However, if there is no power when the building is done, I'd be the single neck to wreck. No building owner, no lead worker. It would be their part to push forward and it would be my part to make (and keep!) it work.

Wednesday, July 7, 2010

We Are Authors!

I just read this statement in Robert C. Martin's Book Clean Code where he primary wrote about "how to write code". However, I see this as a more general thing.

We are all authors and and we are writing almost all the day. Everything we write will be read by somebody, now or in months, so we really should think about the reader when we are writing. Reading can make people annoyed about the author or grateful to him/her and we should always prefer the latter.

Source Code

Did you ever see some, so called "very flexible", solution like this?
// get data returns all new users to be created
object[] rows = GetData();
foreach (object[] row in rows) {


    // ix 0 represents the new user id
    // ix 1 is not needed
    // ix 2 is the name of the new user
    int i = (int)row[0];
    string n = (string)row[2];
    CreateUser(i, n);
}
Guess, the author is just sitting right next to you and you are about to add some new features. He tells you that this approach, using untyped object arrays is very flexible, because you don't have to change the interface if something new comes in.

I call this completely inflexible. It is impossible to do any change anything within the user structure since it is almost impossible to find the positions where the structure is used.

How shall I know that GetData returns users? Does it always return users? Why an array of object as rows, instead of an ICollection<T> to let the user know what's inside the collection? Why another array of object as row, instead of a strong typed NewUser type? Why all the noise comments, instead of self describing variables? What's about "ix 1 is not needed"? Is it always empty until now? Is it reserved for future usage? Am I just not allowed to look into? Why the two empty rows at the beginning of the loop?

How about this?
ICollection<NewUser> newUsers = GetNewUsers();
foreach (NewUser user in newUsers) {
    CreateUser(user.Id, user.UserName);
}
This code does exactly the same as the previous, but we don't need any comments. The reader directly knows what's going on and we are able to refactor NewUser whenever we want because we can find all positions where it is used.

If I might ask myself, who should ever read my code? The most faithful reader of my code am I. If we have to change some implementation, we realized days, months or years before we start with reading the existing code. We are jump back and forth to find out what happens, which objects are used and how they are composed to hit the target.

Writing clean source code means to be kind to your readers; especially you.

T-SQL

For sure, T-SQL is source code too. However, for me it felt important to write this own subject. Why? Because the most ugly code I've seen was written in T-SQL.

Ever seen a procedure/script like this?
insert Into TestData100K SELECT Top(10) id, BitVal, FLOATVAL from TestData10K WHERE 
ID > 99
and intval != 1
Unfortunately (yes, unfortunately) T-SQL is not case sensitive, so the upper/lower case of statements or even words within statements are often only depends on the occasional twitches of the writers little finger. What about line breaks? How about indention?

How about this?
INSERT INTO TestData100K
SELECT TOP(10) 
   Id
   ,BitVal
   ,FloatVal
FROM TestData10K 
WHERE ID > 99
   AND IntVal != 1;
I'd never say that my way to write T-SQL is the best, but you can be sure that all my T-SQL looks like this. It is unimportant if you prefer upper case or lower case keywords like SELECT but take your decision and comply with it. Take your decision how and what to indent and where to add the commas.

The worst T-SQL I've seen was most times written by non database developers who "had to do some ugly T-SQL". T-SQL, like other programming languages, is always as ugly as we write it. (Well, T-SQL provides more features to write bad code than most other languages.) So, it's up to us to write pretty procedures.

Emails

Emails? Why bother? In my opinion, emails are some kind of comedown of humans writing. In good old days, as we wrote letters with paper we all knew some formatting rules, and we followed them. Today, with emails, many people seem to forgot any kinds of formatting.

Did you ever get a mail like this?
Subject: todo
Content:
Please implement NewOrder till friday. WE NEED TO GO ONLINE ON MONDAY!!!
thx

What does this subject say to me; except "somebody has to do something"? Who is responsible to implement the new feature? I have to look to the email header to see, if I am the only person in the "TO" recipients. If there are even more than one "TO" recipients, the recipients have to guess who is responsible. Why capitalizing the production schedule? Capitalized text always feels like screaming. Does this mean I might have to work on weekend to meet the deadline? Why those three exclamation marks? Does this mean I have to stay on Friday, even if it becomes Saturday, until I'm done? After this hard task assignment a sloppy "thx"? Are you kidding me?

How about this?
Subject: Feature NewOrder (until Friday)
Content:
Hi Flo

Please implement the “NewOrder” until next Friday. It's important that we meet the deadline due to a presentation on Monday. Please let me know if you need some additional resources.

Thanks
John
Sure, this is much more text to write and read. However, I'm no machine and I don't need (/want) to get my tasks optimized for fast processing (reading). It's a kind of respect to invest some time for assigning a task in a grateful form.

I don't mean emails to a friend, to ask her for lunch tomorrow. I use to write them sloppy because the content is not really important and I'd probably even speak in slang if I'd stand in front of her.

Conclusion

I'd really appreciate if more people would feel like an author whenever they are writing. To keep the readers in mind whenever we write is some kind of respect which is always worth.

Though, sometimes I'd appreciate if I would follow this engagement more consequent than I actually do. However, I know I'm not perfect, I know (at least some of) my personal issues and I'm still working on.

Tuesday, July 6, 2010

Table-Valued Parameters

This post is initiated by Ken Simmons (Blog | Twitter). He asked to write a blog post that describes our favorite feature of SQL Server. Ken's post here is not only interesting due to the MSDN package :-) but also caused me thinking about what I changed in my database solutions since SQL Server 2008.

For sure, this is not my first post about SQL Server 2008 feature User Defined Table-Valued types. I'd not call them as the greatest feature of SQL Server in general, however they are one of the greatest new features in version 2008.

Bulk Load by Ids

Often it is required to load many rows of one table by their ids or any other attribute. A sample is reloading of already known data.

One solution is to send one query for each item to be (re-)loaded. However, this causes many small queries, stressing server and network.

Another approach is to create one dynamic SQL query, containing an IN statement for all ids to be reloaded. Though, this has several issues. Dynamic creation of SQL statements is always a potential open door for SQL injection. Those IN statements make it impossible for SQL Server to reuse existing execution plans. IN statements with many items require many resources on server side and become slow. This works only if there is one unique column to filter.

The - in my opinion - best solution, before SQL Server 2008, is to create a concatenated string of all IDs to be loaded, send the string to the server, split them into a temp table and use this table joined to the real table to return the requested rows. I wrote a huge post about SQL split functions here. Though, as cool as this approach is, it stays a little hack. When working with more than one column, this approach becomes awkward since you have to handle hierarchical data (lines and items) within the string.

User-Defined Table-Valued types are a new and clean way to bulk load rows from database. Let's start with the definition of a new type.
CREATE TYPE IntIdTable AS TABLE (
   Id INT NOT NULL PRIMARY KEY CLUSTERED
)
Now, we are able to write a stored procedure which gets this type as parameter. (I don't show the referenced table "TestData100K" because it doesn't really matter. Out of others, it contains one column, called "Id". Please, no bashing for the "SELECT *", it's a sample.).
CREATE PROCEDURE GetTestDataByIds
   @ids IntIdTable READONLY
AS
   SELECT td.*
   FROM TestData100K td
      JOIN @ids ids ON td.Id = ids.Id;

Okay, we're done on server side, let's switch to the client.
First thing we need to be able to address User-Defined Table-Types is a class, implementing the IEnumerable<SqlDataRecord>. (Another way would be a DataTable, but I don't like this class due to its huge overhead.)
class IntRecordEnumerator : IEnumerable<SqlDataRecord> {
   public IntRecordEnumerator(IEnumerable<int> values) {
      _values = values;
   }

   IEnumerable<int> _values;

   public IEnumerator<SqlDataRecord> GetEnumerator() {
      SqlMetaData metaData = new SqlMetaData("Id", SqlDbType.Int);

      foreach (int id in _values) {
         SqlDataRecord record = new SqlDataRecord(metaData);
         record.SetInt32(0, id);
         yield return record;
      }
   }

   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}
Now, we can use this class as parameter of a SqlParameter. (Notice, the parameter requires SqlDbType.Structured and TypeName to address the name of the table type.)
[TestMethod]
public void SearchIds() {
   string sql = "GetTestDataByIds";
   using (SqlConnection connection = CreateOpenConnection())
   using (SqlCommand command = new SqlCommand(sql, connection)) {
      command.CommandType = CommandType.StoredProcedure;
      SqlParameter param = command.Parameters.Add("@Ids", SqlDbType.Structured);
      param.TypeName = "IntIdTable";

      int[] idsToLoad = new int[] { 1, 2, 3, 4, 5 };
      param.Value = new IntRecordEnumerator(idsToLoad);

      using (SqlDataReader reader = command.ExecuteReader()) {
         while (reader.Read()) {
            int id = (int)reader["Id"];
            Console.Write(id);
            Assert.IsTrue(idsToLoad.Contains(id));
         }
      }
   }
}

Streaming Data To The Server

In distributed systems it is important to synchronize data between different databases.

If possible to use database features like BCP or SSIS, this is often the best solution. However, sometimes it is required to do some additional work, containing business logic, those solutions require a duplication of business logic into the database. In addition, if there are data to be updated (not only inserted) the data have to be packed into a staging table before they can be handled by a procedure. Staging tables are a nice temporary storage but can become tricky if there are concurrent synchronization jobs.

The classic and, as far as I know, most common solution today is using single INSERT statements to load those bulk data row by row. As denoted above, this becomes a little stress test for database server and network.

Another, a bit esoteric, approach is to use SQL Server 2005's XML functionality to create one huge XML package, containing all data to be written into database server. Problem is XML is a quiet chatty language and causes a huge memory usage on client and server.

Here, User-Defined Table-Valued types can be used to push all data in one step into a procedure. Business logic can be applied while streaming through all data to be synchronized. The procedure finally updates and inserts the data into their destination table. I first wrote about this approach here.

Again, we start with a table type.
CREATE TYPE ImportSampleTable AS TABLE (
   FirstName VARCHAR(30)
   ,LastName VARCHAR(30)
);
A simple procedure, that gets the type as parameter and inserts the rows into a table.
CREATE PROCEDURE ImportSampleFile
   @importData ImportSampleTable READONLY
AS
   INSERT INTO ImportSample
   SELECT *
   FROM @importData;

Heating to the client...
An implementation of a IEnumerable<T>.
class FileToLoad : IEnumerable<SqlDataRecord> {
   public FileToLoad(string fileName, SqlConnection connection) {
      _fileName = fileName;
      _connection = connection;
   }

   string _fileName;
   SqlConnection _connection;
   SqlMetaData[] _metaData;

   public IEnumerator<SqlDataRecord> GetEnumerator() {
      CreateMetaData();

      using (StreamReader reader = new StreamReader(_fileName)) {
         while (!reader.EndOfStream) {
            SqlDataRecord record = CreateRecord(reader);
            yield return record;
         }
      }
   }

   private void CreateMetaData() {
      _metaData = new SqlMetaData[] {
         new SqlMetaData("FirstName", SqlDbType.VarChar, 30),
         new SqlMetaData("LastName", SqlDbType.VarChar, 30),
      };
   }

   private SqlDataRecord CreateRecord(StreamReader reader) {
      string line = reader.ReadLine();
      SqlDataRecord record = new SqlDataRecord(_metaData);
      string[] lineItems = line.Split('\t');
      record.SetString(0, lineItems[0]);
      record.SetString(1, lineItems[1]);
      return record;
   }

   System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
      return this.GetEnumerator();
   }
}
And again, we are ready to use the table valued type as parameter for another test.
[TestMethod]
public void StreamingLoad() {
   using (SqlConnection connection = CreateOpenConnection())
   using (SqlCommand command = new SqlCommand("ImportSampleFile", connection)) {
      command.CommandType = CommandType.StoredProcedure;
      SqlParameter param = command.Parameters.Add("@importData", SqlDbType.Structured);
      param.TypeName = "ImportSampleTable";
      // get a sample file
      string fileName = GetType().Assembly.Location;
      fileName = Path.GetDirectoryName(fileName);
      fileName = Path.Combine(fileName, "FileToLoad.txt");
      
      param.Value = new FileToLoad(fileName, connection);
      command.ExecuteNonQuery();
   }
}

Conclusion

User-Defined Table-Valued types a nice new feature of SQL Server 2008! I did not find many reasons to use them in plain T-SQL, but they provide a native support of set-based solutions between client and server and this is a really great enhancement of SQL Server.

Friday, June 11, 2010

Centralized Error Handling With Lambda

For many reasons it is often useful - and important - to catch exceptions in UI or other boundary classes like web-services. Out of many other the most important reasons are usually security, to avoid to show the information of an internal system error to the client, and traceability, to log the error for investigations.

The problem is, exception handling is quiet complicated code and is often highly redundant, what makes it awkward to keep it in sync.

The Common Approach For Exception Handling

As a simple sample, guess we are developing a web-service method that returns a Country object from data storage by a specified ISO 3-alpha.
[OperationContract]
public Country GetByIsoAlpha3_Simple(string alpha3)
{
    Logger.Trace("Entered GetByIsoAlpha3");
    try
    {
        ThrowHelper.CheckNullArg(alpha3);
        CountrySource source = new CountrySource();
        return source.GetByIsoAlpha3(alpha3);
    }
    catch (ArgumentException ex)
    {
        throw new FaultException(ex.Message);
    }
    catch (DbException ex)
    {
        Logger.LogEx(ex);
        throw new FaultException("There was error with the data storage.");
    }
    catch (Exception ex)
    {
        Logger.LogEx(ex);
        throw new FaultException("A internal error occurred");
    }
    finally
    {
        Logger.Trace("Leaving GetByIsoAlpha3");
    }
}

As we see, the real functionality is only two lines of code, the rest of the 25 lines are standard exception handling and the validation of the input values.

The problem with this method is, it's very dangerous remember all required tasks for standard exception handling. Does it handle all standard exceptions we usually handle? Might it be useful to log the ArgumentException? Do we really want to tell the caller that we had an error with the data storage? How can we ensure that all web methods do the same standard exception handling.

Extract The Exception Handling

A good way to ensure all standard exception handling is to move it out into a static helper class.
static class WsHelper
{
    public static T ExecuteWebMethod<T>(string methodName, Func<T> method)
    {
        Logger.Trace("Entered " + methodName);
        try
        {
            return method();
        }
        catch (ArgumentException ex)
        {
            throw new FaultException(ex.Message);
        }
        catch (DbException ex)
        {
            Logger.LogEx(ex);
            throw new FaultException("There was error with the data storage.");
        }
        catch (Exception ex)
        {
            Logger.LogEx(ex);
            throw new FaultException("A internal error occurred");
        }
        finally
        {
            Logger.Trace("Leaving " + methodName);
        }
    }
}

Using Lambda Expressions To Utilize The Helper

Now, we can utilize the ExecuteWebMethod method with a simple Lambda expression.
[OperationContract]
public Country GetByIsoAlpha3_Lambda(string alpha3)
{
    return WsHelper.ExecuteWebMethod("GetByIsoAlpha3_Lambda", () =>
    {
        ThrowHelper.CheckNullArg(alpha3);
        CountrySource source = new CountrySource();
        return source.GetByIsoAlpha3(alpha3);
    });
}
As we see, the code of our web method is reduced to six lines. Still three lines for our method specific work, two lines with braces and one line calling the helper method which gets a Func<T> that represents the implementation of the web method.

Possible Extensions

The exceptions handled in this sample are only exemplary. If you have some other common standard exceptions like a ValidationException or a general BusinessException, feel free to extend the ExecuteWebMethod.
For sure, you noticed the methodName parameter, provided for the ExecuteWebMethod operation. To get rid of this parameter we can use the StackTrace class to determine the calling method. In case of web services or other boundaries, this is a good approach. Nevertheless, be careful with StackTrace in highly utilized methods since stack trace evaluation is an expensive task.

Restrictions

The helper method should be only responsible for standard exception handling. Don't try to move all use case specific exception handling into the external helper method.

Conclusion

Lambda is generally a powerful feature to inject standard method with external functionality. This blog post showed only one out of many solutions to decorate a specific implementation with other functionalities.

Sunday, May 30, 2010

Streaming Serialization with XmlSerializer

As known, .NET provides several different ways to serialize and de-serialize objects. One of those serialization techniques is the XmlSerializer class. Generally this class provides a nice, straight forward approach. However, there is one problem with this (and most other) serialization classes, it does not support streaming. If you want to (de-)serialize really large counts of objects, you might be run into physical memory problems.

One solution is to implement the IXmlSerializable interface, which exposes the ReadXml(XmlReader) and WriteXml(XmlWriter) operations. Though, this ends in quiet a bunch of complicated code.

To handle the memory problem when using the XmlSerializer class, here you can find two simple wrapper classes which provide a streaming functionality.

XmlStreamingSerializer

The XmlStreamingSerializer creates an internal instance of a XmlSerializer and a XmlWriter which provides the persistence management. To avoid the "xsi" and "xsd" namespaces again and again for each object to be serialized, it simply provides an empty XmlSerializerNamespaces. I found that trick in one of the comments on Scott Hanselman's blog about XmlFragmentWriter.
public class XmlStreamingSerializer<T> {
   // ------------------------------------------------------------------
   static XmlStreamingSerializer() {
      _ns = new XmlSerializerNamespaces();
      _ns.Add("", "");
   }
   // ------------------------------------------------------------------
   private XmlStreamingSerializer() {
      _serializer = new XmlSerializer(typeof(T));
   }
   // ------------------------------------------------------------------
   public XmlStreamingSerializer(TextWriter w)
      : this(XmlWriter.Create(w)) {
   }
   // ------------------------------------------------------------------
   public XmlStreamingSerializer(XmlWriter writer) : this() {
      _writer = writer;
      _writer.WriteStartDocument();
      _writer.WriteStartElement("ArrayOf" + typeof(T).Name);
   }
   // ==================================================================
   static XmlSerializerNamespaces _ns;
   XmlSerializer _serializer = new XmlSerializer(typeof(T));
   XmlWriter _writer;
   bool _finished;
   // ==================================================================
   public void Finish() {
      _writer.WriteEndDocument();
      _writer.Flush();
      _finished = true;
   }
   // ------------------------------------------------------------------
   public void Close() {
      if (!_finished)
         Finish();
      _writer.Close();
   }
   // ------------------------------------------------------------------
   public void Serialize(T item) {
      _serializer.Serialize(_writer, item, _ns);
   }
}

XmlStreamingDeserializer

As the serializer, the XmlStreamingDeserializer class wraps an instance of a .NET XmlSerializer. It uses a XmlReader to provide the streaming functionality and utilizes the XmlReader.ReadSubtree method to get the current serialized item into the XmlSerializer.
public class XmlStreamingDeserializer<T> {
   // ------------------------------------------------------------------
   static XmlStreamingDeserializer() {
      _ns = new XmlSerializerNamespaces();
      _ns.Add("", "");
   }
   // ------------------------------------------------------------------
   private XmlStreamingDeserializer() {
      _serializer = new XmlSerializer(typeof(T));
   }
   public XmlStreamingDeserializer(TextReader reader)
      : this(XmlReader.Create(reader)) {
   }
   public XmlStreamingDeserializer(XmlReader reader) : this() {
      _reader = reader;
   }
   // ==================================================================
   static XmlSerializerNamespaces _ns;
   XmlSerializer _serializer = new XmlSerializer(typeof(T));
   XmlReader _reader;
   // ==================================================================
   public void Close() {
      _reader.Close();
   }
   // ------------------------------------------------------------------
   public T Deserialize() {
      while (_reader.Read()) {
         if (_reader.NodeType == XmlNodeType.Element
               && _reader.Depth == 1
               && _reader.Name == typeof(T).Name) {
            XmlReader reader = _reader.ReadSubtree();
            return (T)_serializer.Deserialize(reader);
         }
      }
      return default(T);
   }
}

Some Tests

Here you can find some performance test results.

I serialized a very simple object structure shown below.
public class Foo {
   [XmlAttribute]
   public int Id { get; set; }
   [XmlAttribute]
   public string Bar { get; set; }
   public List<foo> SubFoos { get; set; }
}
I serialized 10,000 instances, each with 100 sub items. Maybe you don't have to serialize so many objects, but the depth of serialized objects is often far deeper than two levels.
ActionDuration (ms)RAM (MB)
Serialization with XmlSerializer2954134
Serialization with XmlStreamingSerializer239113
De-serialization with XmlSerializer3662150
De-serialization with XmlStreamingDeserializer295313
Those test results are not representative for any purpose. Especially the apparently faster processing of the streaming classes seems to be a bit curious to me. Nevertheless, the more important factor in this test results it the far less memory requirement (about 10% here).

Possible Extensions

Both classes, shown above are quiet simple and there are several possible extensions to get them more powerful.
  • One class for both. If you prefer one serialization class for serialization and de-serialization, feel free to merge both into one. I preferred the two-class solution to keep each of them more pure.
  • Configurable document element name. Since now, the serializer creates a hard-coded document element name called "ArrayOfMyType". This behavior matches to the XmlSerializer behavior, when serializing an IEnumerable<out T>. Feel free make this
  • Xml Namespaces. The above serialization classes doe not support XML namespaces. If you need namespaces just add something like a public XmlSerializerNamespaces property.
  • Different objects. XML files often contain more than one type of objects. Support of different object types could easily be added by a dictionary of serializers.
The intension of this blog was not to present a fully dressed streaming XML framework. I tried to a prove of concept for how to handle large amounts of objects in combination with the .NET XmlSerializer class.

Attachments

Here you can find the XmlStreamingSerializer and XmlStreamingDeserializer classes as well as a simple console application I used for my tests.

Friday, May 14, 2010

Layers, Bondaries And How To Exchange Data

One of the most popular (architecual) patterns is the Three-Layer design. Unfortunately, this pattern is often convounded with the three-tier pattern. I take the definition of authors like Martin Fowler, Thomas Erl and Craig Larman, which says:
  • The Three-Tier pattern describes a software deployment into three physical tiers. A common solution for this pattern is a database server, a web server and a web client (browser)
  • The Three-Layer pattern describes how to split sofware concerns into three different layers. These layers are a data acess layer, a business logic layer and a presentation layer. The layers might also be deployed into different tiers, though, this is optional.

The below picture shows the three layers and their positions.


The intention of this blog is not to describe how to implement the Three-Layer pattern, there are already milions of web resources which does this. This blog focuses on the data exchange between the data access layer (DAL) and the business logic layer (BLL).

The Architectual Challange

Most sources, describing the three-layer pattern, advice to separate the DAL and the BLL into two different assemblies. The reason is to avoid mixing different concerns, what is a good thing. This brings up an intersting issue. How to create BLL objects (like Customer or Order) from DAL?

Why? Because each layer should only communicate with the layer directly below. A layer should never communicate with a above layer. So, the BLL communicates with the DAL, but the DAL never (active) communicates with the BLL. However, the business objects (aka. entities, domain objects, ...) are stored within the BLL, so how to create objects which are not known within the DAL?

The following topics describe diferent solutions of how to get data from DAL into BLL and back.

DataTables

DataTables are one option to transport data from DAL to the BLL. The DAL gets the data from data store, transforms the data into a DataTable and returns the table to the calling BLL. The BLL gets the DataTable and transforms the data into business objects. For saving changes, the BLL transforms the business objects into a DataTable and calls a distinct save method from DAL.

The advantages of DataTable objects are, they support other rich features like complex sorting, searching and data binding. They are serializable and can be used for data exchange over network.

The disadvantages are, DataTables (especially if not using typed DataSets) the contained data are not guaranteed to be valid by type. Another issue of DataTables is the overhead, since DataTables objects are very rich objects. Last but not least, this approach requires a huge effort of quiet simple mapping code (= unhappy developer) within the BLL from a DataTable into business objects and back into a DataTable.

// ====================================================================
// DAL
public DataTable GetCustomers(object criteria) {
   using (IDataReader reader = CreateReader(criteria)) {
      // create results DataTable
      DataTable result = new DataTable();
      result.Columns.Add("Id", typeof(int));
      result.Columns.Add("Name", typeof(string));
      // fill the table
      while (reader.Read()) {
         DataRow row = result.NewRow();
         row["Id"] = reader["Id"];
         row["Name"] = reader["Name"];
         result.Rows.Add(row);
      }

      return result;
   }
}
// --------------------------------------------------------------------
public void SaveCustomers(DataTable data) { /* do save*/ }
// ====================================================================
// BLL
public void DoFoo() {
   // get the DataTable from DAL
   DataTable data = DAL.GetCustomers(null);
   // convert the DataTable into business objects
   IList<Customer> customers = ConvertDataTable(data);
   // --------------------------------------------
   // do business staff
   // --------------------------------------------
   // convert the business objects into a DataTable to return to DAL
   data = ConvertCustomers(customers);
   DAL.SaveCustomers(data);
}
// --------------------------------------------------------------------
private IList<Customer> ConvertDataTable(DataTable data) {
   // convert DataTable into business objects
   List<Customer> customers = new List<Customer>();
   foreach (var row in data.AsEnumerable()) {
      Customer cust = new Customer();
      cust.Id = (int)row["Id"];
      cust.Name = (string)row["Name"];
      customers.Add(cust);
   }
   return customers;
}
// --------------------------------------------------------------------
private DataTable ConvertCustomers(IEnumerable<Customer> customers) {
   // convert business objects into DataTable
   DataTable data = new DataTable();
   data.Columns.Add("Id", typeof(int));
   data.Columns.Add("Name", typeof(string));
   foreach (var cust in customers) {
      DataRow row = data.NewRow();
      row["Id"] = cust.Id;
      row["Name"] = cust.Name;
      data.Rows.Add(row);
   }
   return data;
}

Reflection

Another way to skin the cat is to use .NET (or Java) Reflection features to automate the mapping. The DAL gets the business object types from meta data (like a config file) and handles all the business object creation and field mapping dynamically.

The advantage of using reflection is, there is very less code to be written and it can be reused for many projects.

A disadvantage of reflection is bad runtime performance. Customizations, like differences between data source and business object structures are very difficult to be implemented and error investigations become awkward.

// ====================================================================
// DAL
// the type of the destination business object (usually, from config)
private Type EntityType { get { return typeof(Customer); } }
// --------------------------------------------------------------------
public IList GetCustomers(object criteria) {
   List<object> result = new List<object>();
   using (IDataReader reader = CreateReader(criteria)) {
      while (reader.Read()) {
         // get empty constructor
         object entity = EntityType.GetConstructor(Type.EmptyTypes).Invoke(new object[0]);
         // fill all DB data into object properties (with same name)
         for (int i = 0; i < reader.FieldCount; i++) {
            string name = reader.GetName(i);
            PropertyInfo prop = EntityType.GetProperty(name);
            prop.SetValue(entity, reader.GetValue(i), null);
         }
         result.Add(entity);
      }
   }
   return result;
}
// --------------------------------------------------------------------
public void SaveCustomers(IList data) { /* do save*/ }
// ====================================================================
// BLL
public void DoFoo() {
   // get result from DAL
   IList untyped = DAL.GetCustomers(null);
   // convert into typed list
   List<Customer> customers = untyped.Cast<Customer>().ToList();
   // --------------------------------------------
   // do business staff
   // --------------------------------------------
   DAL.SaveCustomers(customers);
}

Data Transfer Objects

Data Transfer Objects (DTOs) are a very clean and type safe solution for data exchange between the layers. They are usually defined within the DAL, will be filled there and mapped into the destination business objects in the BLL.

The advantages of DTOs are the strong typing and a a good maintainibility. Any kind of mapping customizations are simple to be done.

The disadvantage is the huge amount of simple mapping code (= unhappy developer) within the BLL to convert received DTOs into business objects and back into DTOs for saving concerns. The best solution to avoid writing all the mapping code is using some kind of souce code generation like a CASE tool.

// ====================================================================
// DAL
// a customer DTO object for data exchange
public class CustomerDTO {
   public int Id { get; set; }
   public string Name { get; set; }
}
// --------------------------------------------------------------------
public IList<CustomerDTO> GetCustomers(object criteria) {
   // create list of DTOs to exchange result data
   IList<CustomerDTO> result = new List<CustomerDTO>();
   using (IDataReader reader = CreateReader(criteria)) {
      while (reader.Read()) {
         // create DTOs
         CustomerDTO dto = new CustomerDTO();
         dto.Id = (int)reader["Id"];
         dto.Name = (string)reader["Name"];
         result.Add(dto);
      }
   }
   return result;
}
// --------------------------------------------------------------------
public void SaveCustomers(IEnumerable<CustomerDTO> dtos) { /* do save*/ }
// ====================================================================
// BLL
public void DoFoo() {
   // get DTOs
   IList<CustomerDTO> dtos = DAL.GetCustomers(null);
   // convert into business objects
   IList<Customer> customers = ConvertDTOs(dtos);
   // --------------------------------------------
   // do business staff
   // --------------------------------------------
   // convert back into DTOs to exchange with DAL
   dtos = ConvertCustomers(customers);
   DAL.SaveCustomers(dtos);
}
// --------------------------------------------------------------------
private IList<Customer> ConvertDTOs(IList<CustomerDTO> dtos) {
   // convert DTOs into business objects
   List<Customer> customers = new List<Customer>();
   foreach (var dto in dtos) {
      Customer cust = new Customer();
      cust.Id = dto.Id;
      cust.Name = dto.Name;
      customers.Add(cust);
   }
   return customers;
}
// --------------------------------------------------------------------
private IList<CustomerDTO> ConvertCustomers(IEnumerable<Customer> customers) {
   // convert business objects into DTOs
   IList<CustomerDTO> dtos = new List<CustomerDTO>();
   foreach (var cust in customers) {
      CustomerDTO dto = new CustomerDTO();
      dto.Id = cust.Id;
      dto.Name = cust.Name;
      dtos.Add(dto);
   }
   return dtos;
}

Interfaces as DTOs

A last solution (which is my personal favorite) is to create an additional assembly (usually called Project.Core.dll) which contains only the interfaces that describe the business objects data properties. This library is references from both layers, DAL and BLL. It enables the DAL to assign all data directly to the destination objects, that are implemented within the BLL. To create new instances of objects you can use either any kind of Dependency Injection (DI) framework like the Unity application block from Microsoft Enterprise Library (MEL) or something like an IFactory<T> interface, defined in the Core library, which is used as an Abstract Factory and implemented within the BLL. To inject the DAL with the abstract factory, you can use DI or static code.

The advantages of this approach are all the advantages of DTOs. In addition there is way less source code to be written, since no additional mapping into business objects is needed; the DTOs are covered by the interfaces, which are realized by the business objects.

The disadvantages of this solution is the additional library, conaining the interfaces and an additional architectual complexity due to the DI framework and/or the factory class(es).

// ====================================================================
// Core library
// interface for Customer to exchange data
public interface ICustomer {
   int Id { get; set; }
   string Name { get; set; }
}
// --------------------------------------------------------------------
// factory interface to create new business objects from DAL
public interface IFactory<T> {
   T Create();
}
// ====================================================================
// DAL
public class CustomerMapper<T> : SamplesMapperBase
    where T : ICustomer {
   // the factory; injected at runtime by BLL or any kind of DI
   public IFactory<T> Factory { get; set; }

   public IList<T> GetCustomers(object criteria) {
      IList<T> result = new List<T>();
      using (IDataReader reader = CreateReader(criteria)) {
         while (reader.Read()) {
            // create business object instance from factory
            T customer = Factory.Create();
            customer.Id = (int)reader["Id"];
            customer.Name = (string)reader["Name"];
            result.Add(customer);
         }
      }
      return result;
   }

   public void SaveCustomers(IList<T> data) { /* do save*/ }
}
// ====================================================================
// BLL
// simple sample implementation for customer factory
class CustomerFactory : IFactory<Customer> {
   public Customer Create() {
      return new Customer();
   }
}
// --------------------------------------------------------------------
// instance of customer mapper. The generic parameter specifies the
// business objects type
public CustomerMapper<Customer> DAL { get; set; }
// --------------------------------------------------------------------
public void DoFoo() {
   // directly get the business objects from DAL
   IList<Customer> customers = DAL.GetCustomers(null);
   // --------------------------------------------
   // do business staff
   // --------------------------------------------
   DAL.SaveCustomers(customers);
}

Other Solutions

There might be myriad of other solutions to exchange data between those both layers, like XML or untyped Object-Arrays, though, in this blog I tried to show the (in my opinion) most popular and efficient.

<off topic>Long time ago since last blog... It's good to find some time for writing.</off topic>

Monday, February 1, 2010

Assert Helper Class

Today I'll share one of my simple but handy helper classes. Unit testing frameworks like Visual Studio Unit Tests or NUnit usually provide a validation class usually called "Assert" that provides a large set of static methods for test result validation. Without those frameworks the standard .NET way for runtime validation in debug sessions is System.Diagnostics.Debug.Asser(bool) method. Generally, this method works fine, but I prefer the more handy methods of unit testing frameworks like AreEqual(arg1, arg2) or IsNotNull(arg). Therefore, I created a simple static helper class called Assert which covers most of those test framework methods.

The class does nothing special, it just maps the standard Debug.Assert and Debug.Fail methods to a more sophisticated interface. Like .NET Debug.Assert method, all methods of the class are marked with System.Diagnostics.ConditionalAttribute and condition "DEBUG". So all method calls in project code become removed when compiling in "Release Mode".

Here's a little sample which shows how to use the class compared with Debug.Assert method.
// input parameter validation in private methods
Assert.NullArg(someParam, "someParam");
Debug.Assert(someParam != null, "someParam cannot be null");

// not null validation
List<int> list = new List<int>();
Assert.IsNotNull(list);
Debug.Assert(list != null);

// null validation
object nullValue = null;
Assert.IsNull(nullValue);
Debug.Assert(nullValue == null);

// equality validation of reference types (which usually
// don't overwrite the "==" operator)
string s1 = "foo", s2 = "foo";
Assert.AreEqual(s1, s2);
Debug.Assert(
   (s1 == null && s2 == null)
   || (s1 != null && s1.Equals(s2))
   );

// type validation
object o1 = "foo";
Assert.IsOfType(o1, typeof(string));
Debug.Assert(
   o1 != null 
   && typeof(string).IsAssignableFrom(o1.GetType()));

// not equal validation with error text
int num1 = 1, num2 = 2;
Assert.AreNotEqual(num1, num2, "{0} cannot be {1}", num1, num2);
Debug.Assert(num1 != num2,
             string.Format("{0} cannot be {1}", num1, num2));


And here's the complete code of the class.
using System;
using System.Diagnostics;
using System.Globalization;

namespace ClassLibrary1.Diagnostics {
   /// <summary>
   /// Helper class for debug assertions.
   /// </summary>
   /// <remarks>
   /// This class represents a wrapper of 
   /// <see cref="System.Diagnostics.Debug"/> class which provides a 
   /// richer interface and enables a simplified debug assertion.
   /// <p/>
   /// All methods are marked with 
   /// <see cref="System.Diagnostics.ConditionalAttribute"/> with 
   /// "DEBUG" condition string. This causes the method calls to be 
   /// removed from production source code when compiling in "Release" 
   /// mode.
   /// </remarks>
   public static class Assert {
      /// <summary>
      /// Asserts two specified argiments to be equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for 
      /// equality.</param>
      /// <param name="arg2">The first argument to be compared for 
      /// equality.</param>
      [Conditional("DEBUG")]
      public static void AreEqual(object arg1, object arg2) {
         if (arg1 == null && arg2 == null)
            return;
         if (arg1 == null)
            Debug.Assert(false);
         Debug.Assert(arg1.Equals(arg2));
      }

      /// <summary>
      /// Asserts two specified argiments to be equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for 
      /// equality.</param>
      /// <param name="arg2">The first argument to be compared for 
      /// equality.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void AreEqual(object arg1, object arg2, string message, 
                                  params object[] args) {
         if (arg1 == null && arg2 == null)
            return;
         if (arg1 == null)
            Debug.Assert(false, GetMessage(message, args));
         Debug.Assert(arg1.Equals(arg2), GetMessage(message, args));
      }

      /// <summary>
      /// Asserts two specified argiments to be equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for 
      /// equality.</param>
      /// <param name="arg2">The first argument to be compared for 
      /// equality.</param>
      [Conditional("DEBUG")]
      public static void AreEqual<T>(T arg1, T arg2) {
         if (arg1 == null && arg2 == null)
            return;
         if (arg1 == null)
            Debug.Assert(false);
         Debug.Assert(arg1.Equals(arg2));
      }

      /// <summary>
      /// Asserts two specified argiments to be equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for 
      /// equality.</param>
      /// <param name="arg2">The first argument to be compared for 
      /// equality.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void AreEqual<T>(T arg1, T arg2, string message, 
                                     params object[] args) {
         if (arg1 == null && arg2 == null)
            return;
         if (arg1 == null)
            Debug.Assert(false, GetMessage(message, args));
         Debug.Assert(arg1.Equals(arg2), GetMessage(message, args));
      }

      /// <summary>
      /// Asserts two specified argiments to be not equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for not 
      /// being equal.</param>
      /// <param name="arg2">The first argument to be compared for not 
      /// being equal.</param>
      [Conditional("DEBUG")]
      public static void AreNotEqual(object arg1, object arg2) {
         if (arg1 == null && arg2 == null)
            Debug.Assert(false);
         if (arg1 == null)
            return;
         Debug.Assert(!arg1.Equals(arg2));
      }

      /// <summary>
      /// Asserts two specified argiments to be not equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for not 
      /// being equal.</param>
      /// <param name="arg2">The first argument to be compared for not 
      /// being equal.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void AreNotEqual(object arg1, object arg2, string message, 
                                     params object[] args) {
         if (arg1 == null && arg2 == null)
            Debug.Assert(false, GetMessage(message, args));
         if (arg1 == null)
            return;
         Debug.Assert(!arg1.Equals(arg2), GetMessage(message, args));
      }

      /// <summary>
      /// Asserts two specified argiments to be not equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for not 
      /// being equal.</param>
      /// <param name="arg2">The first argument to be compared for not 
      /// being equal.</param>
      [Conditional("DEBUG")]
      public static void AreNotEqual<T>(T arg1, T arg2) {
         if (arg1 == null && arg2 == null)
            Debug.Assert(false);
         if (arg1 == null)
            return;
         Debug.Assert(!arg1.Equals(arg2));
      }

      /// <summary>
      /// Asserts two specified argiments to be not equal to each other.
      /// </summary>
      /// <param name="arg1">The first argument to be compared for not 
      /// being equal.</param>
      /// <param name="arg2">The first argument to be compared for not 
      /// being equal.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void AreNotEqual<T>(T arg1, T arg2, string message, 
                                        params object[] args) {
         if (arg1 == null && arg2 == null)
            Debug.Assert(false, GetMessage(message, args));
         if (arg1 == null)
            return;
         Debug.Assert(!arg1.Equals(arg2), GetMessage(message, args));
      }

      /// <summary>
      /// Generates a failing assertion.
      /// </summary>
      [Conditional("DEBUG")]
      public static void Fail() {
         Debug.Assert(false);
      }

      /// <summary>
      /// Generates a failing assertion.
      /// </summary>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void Fail(string message, params object[] args) {
         Debug.Assert(false, GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a specified value not to be null.
      /// </summary>
      /// <param name="value">The value to be asserted not to be null.
      /// </param>
      [Conditional("DEBUG")]
      public static void IsNull(object value) {
         Debug.Assert(value == null);
      }

      /// <summary>
      /// Asserts a specified value not to be null.
      /// </summary>
      /// <param name="value">The value to be asserted not to be null.
      /// </param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      public static void IsNull(object value, string message, 
                                params object[] args) {
         Debug.Assert(value == null, GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a specified value to be null.
      /// </summary>
      /// <param name="value">The value to be asserted to be null.
      /// </param>
      [Conditional("DEBUG")]
      public static void IsNotNull(object value) {
         Debug.Assert(value != null);
      }

      /// <summary>
      /// Asserts a specified value to be null.
      /// </summary>
      /// <param name="value">The value to be asserted to be null.
      /// </param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for 
      /// specified message.</param>
      [Conditional("DEBUG")]
      public static void IsNotNull(object value, string message, 
                                   params object[] args) {
         Debug.Assert(value != null, GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a specified parameter to be null.
      /// </summary>
      /// <param name="value">The parameter to be asserted to be null.
      /// </param>
      /// <param name="paramName">The name of the parameter not to be 
      /// null.</param>
      /// <remarks>
      /// The difference between <see cref="NullArg"/> and IsNotNull
      /// is that <see cref="NullArg"/> should be used for parameters 
      /// provided to a method. IsNotNull should be used for other use 
      /// cases like method return values or conditional variable 
      /// initializations.
      /// </remarks>
      [Conditional("DEBUG")]
      public static void NullArg(object value, string paramName) {
         if (value == null)
            Debug.Fail(string.Concat(paramName + " cannot be null"));
      }

      /// <summary>
      /// Asserts a specified value to be assignable by a specified 
      /// type.
      /// </summary>
      /// <param name="value">The value to be asserted.</param>
      /// <param name="expectedType">The type to validate the specified 
      /// value to be assignable from.</param>
      [Conditional("DEBUG")]
      public static void IsOfType(object value, Type expectedType) {
         if (value == null)
            Debug.Assert(false);
         Debug.Assert(expectedType.IsAssignableFrom(value.GetType()));
      }

      /// <summary>
      /// Asserts a specified value to be assignable by a specified type.
      /// </summary>
      /// <param name="value">The value to be asserted.</param>
      /// <param name="expectedType">The type to validate the specified 
      /// value to be assignable from.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void IsOfType(object value, Type expectedType, 
                                  string message, params object[] args) {
         if (value == null)
            Debug.Assert(false, GetMessage(message, args));
         Debug.Assert(expectedType.IsAssignableFrom(value.GetType()), 
                      GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a specified value not to be assignable by a specified 
      /// type.
      /// </summary>
      /// <param name="value">The value to be asserted.</param>
      /// <param name="expectedType">The type to validate the specified 
      /// value not to be assignable from.</param>
      [Conditional("DEBUG")]
      public static void IsNotOfType(object value, Type expectedType) {
         if (value == null)
            Debug.Assert(false);
         Debug.Assert(!expectedType.IsAssignableFrom(value.GetType()));
      }

      /// <summary>
      /// Asserts a specified value not to be assignable by a specified 
      /// type.
      /// </summary>
      /// <param name="value">The value to be asserted.</param>
      /// <param name="expectedType">The type to validate the specified 
      /// value not to be assignable from.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void IsNotOfType(object value, Type expectedType, 
                                     string message, params object[] args) {
         if (value == null)
            Debug.Assert(false, GetMessage(message, args));
         Debug.Assert(!expectedType.IsAssignableFrom(value.GetType()), 
                      GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a target type to be assignable by a expected type.
      /// </summary>
      /// <param name="targetType">The target type to be assignable.
      /// </param>
      /// <param name="expectedType">The expected type to be assignable 
      /// by specified target type.</param>
      [Conditional("DEBUG")]
      public static void IsTypeOf(Type targetType, Type expectedType) {
         Debug.Assert(expectedType.IsAssignableFrom(targetType));
      }

      /// <summary>
      /// Asserts a target type to be assignable by a expected type.
      /// </summary>
      /// <param name="targetType">The target type to be assignable.
      /// </param>
      /// <param name="expectedType">The expected type to be assignable 
      /// by specified target type.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void IsTypeOf(Type targetType, Type expectedType, 
                                  string message, params object[] args) {
         Debug.Assert(expectedType.IsAssignableFrom(targetType), 
                      GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a specified condition to be true.
      /// </summary>
      /// <param name="condition">The condition to be asserted for being 
      /// true.</param>
      [Conditional("DEBUG")]
      public static void IsTrue(bool condition) {
         Debug.Assert(condition);
      }

      /// <summary>
      /// Asserts a specified condition to be true.
      /// </summary>
      /// <param name="condition">The condition to be asserted for being 
      /// true.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void IsTrue(bool condition, string message, 
                                params object[] args) {
         Debug.Assert(condition, GetMessage(message, args));
      }

      /// <summary>
      /// Asserts a specified condition not to be true.
      /// </summary>
      /// <param name="condition">The condition to be asserted for not 
      /// being true.</param>
      [Conditional("DEBUG")]
      public static void IsFalse(bool condition) {
         Debug.Assert(!condition);
      }

      /// <summary>
      /// Asserts a specified condition not to be true.
      /// </summary>
      /// <param name="condition">The condition to be asserted for not 
      /// being true.</param>
      /// <param name="message">The message to be shown when assertion 
      /// fails.</param>
      /// <param name="args">Optional formatting arguments for specified 
      /// message.</param>
      [Conditional("DEBUG")]
      public static void IsFalse(bool condition, string message, 
                                 params object[] args) {
         Debug.Assert(!condition, GetMessage(message, args));
      }

      private static string GetMessage(string message, 
                                       params object[] args) {
         if (args != null && args.Length != 0)
            return string.Format(CultureInfo.InvariantCulture, message, args);
         return message;
      }
   }
}

Saturday, January 23, 2010

.NET 4.0 Lazy<T> Class

.NET 4.0 and Visual Studio 2010 is raring to go. Current release date is somewhere in May 2010.

One of the new features of new framework is the Lazy<T> class which supports lazy initialization of objects.

As I wrote in some of my previous blogs I'm not best friend of (wrong used) Lazy Loading. However, there are some use cases where this is a very powerful feature. Therefore I use the phrase "Lazy Initialization" instead of "Lazy Loading" in below samples for Lazy<T> class. "Lazy Loading" usually implies automatic loading of reference data in a Data Access Layer. When I speak about "Lazy Initialization" this implies a lazy initialization of some (rare) object properties which require many or unmanaged system resources.

The following sample class shows a common case where lazy initialization becomes a very useful and often required implementation.
sealed class Product1 : IProduct {
   // -------------------------------------------------------------------
   public Product1(string name, Stream picture) {
      Name = name;
      Picture = picture;
   }
   // -------------------------------------------------------------------
   public string Name { get; set; }
   public Stream Picture { get; set; }
}
The "Picture" property of "Product" class is usually very rare needed in a whole system and it works with (probably) unmanaged resources which might cause a very high memory allocation. Always to initialize the picture of a product which might only be required in a web front-end or a sales application would cause a huge I/O traffic which is not required in system directions like billing or call-center order management.

Lazy<T> Class

The Lazy<T> class represents a wrapper class to handle those kind of property initialization.

As a quote from MSDN:
Although you can write your own code to perform lazy initialization, we recommend that you use Lazy instead. Lazy and its related types also support thread-safety and provide a consistent exception propagation policy.
If a system already works with its custom implemented lazy initialization, I'd advice to stay with this. If lazy initialization is a new requirement, we shouldn't run into evil Not Invented Here software project anti-pattern ;-) and use the existing implementation.

Here's a list of methods and constructors provided by Lazy<T> class (descriptions from MSDN):
  • ctor(valueFactory Func<T>): Initializes a new instance of the Lazy(T) class. When lazy initialization occurs, the specified initialization function is used.
  • ctor(valueFactory Func<T>, isThreadSafe bool): Initializes a new instance of the Lazy(T) class. When lazy initialization occurs, the specified initialization function and initialization mode are used.
  • Value: Gets a value that indicates whether lazy initialization has occurred for this Lazy(T) instance.
  • IsValueCreated: Gets the lazily initialized value of the current Lazy(T) instance.
Lazy<T> class provides some more methods and constructors, but the above seem to be the most important in my opinion.

As a side note for isThreadSafe parameter, a second quote from MSDN:
Some Lazy constructor overloads have a Boolean parameter named isThreadSafe that is used to specify whether the Value() property will be accessed from multiple threads. If you intend to access the property from just one thread, pass in false to obtain a modest performance benefit. If you intend to access the property from multiple threads, pass in true to instruct the Lazy instance to correctly handle race conditions in which one thread throws an exception at initialization time. If you use a constructor that does not have the isThreadSafe parameter, the value defaults to true.
If an application never works with multiple threads, isThreadSafe parameter with value false causes a faster execution due to no thread locking overhead.

How to Use

First, the "Product" class needs to be modified to support lazy initialization. I moved the stream creation into an external factory class
// ======================================================================
sealed class Product2 : IProduct {
   // -------------------------------------------------------------------
   public Product2(string name, string fileName) {
      Name = name;
      _picture = new Lazy<Stream>(() => PictureFactory.GetPicture(fileName), true);
   }
   // -------------------------------------------------------------------
   public string Name { get; set; }
   public Stream Picture {
      get { return _picture.Value; }
      set { _picture = new Lazy<Stream>(() => value, true); }
   }
   Lazy<Stream> _picture;
}
// ======================================================================
static class PictureFactory {
   // -------------------------------------------------------------------
   // get picture from disc
   public static Stream GetPicture(string fileName) {
      Console.WriteLine("## Lazy Initializing Image ##");
      if (string.IsNullOrEmpty(fileName))
         fileName = @"C:\Temp\not_available.jpg";
      return new FileStream(fileName, FileMode.Open, FileAccess.Read);
   }
}
Instead of a direct member field mapping of "Picture" property, "Product" class now works with a Lazy<T> wrapper which only initializes resources when it is requested. This implementation uses the isThreadSafe parameter with value true since we will reuse it below in a multi-threaded sample.

The "fileName" parameter in constructor and factory class to initialize the picture stream is just an example. If no file name was provided, the factory returns the stream of a "not_available.jpg" as default picture (see Special Case pattern).

Notice the Console.WriteLine within the factory class. This will show when data become initialized.

The following sample creates some products and simulates a rendering of product pictures into console.
// ======================================================================
class Program {
   // -------------------------------------------------------------------
   static void Main(string[] args) {
      Console.WriteLine("Creating products");
      List<IProduct> products = new List<IProduct>{
         new Product2("Lord of the Rings", null),
         new Product2("Avatar", @"C:\Temp\avatar.jpg"),
      };

      Console.WriteLine("Loop through products and print picture");
      products.ForEach(p => ConsolePictureRenderer.PrintPicture(p));
      products.ForEach(p => ConsolePictureRenderer.PrintPicture(p));
      Console.ReadKey();
   }
}
// ======================================================================
static class ConsolePictureRenderer {
   // -------------------------------------------------------------------
   public static void PrintPicture(IProduct product) {
      Console.WriteLine();
      // simulate picture rendering by a simple console output
      Console.WriteLine("Picture of {0}", product.Name);
   }
}

As shown in console output, picture data become initialized when it becomes accessed. In addition, picture factory is called only once for each instance of a product.



Multi Threaded

As written above, Lazy<T> class represents a thread-safe implementation of a lazy initialization class. That means, picture factory class does not need to implement a custom thread locking. It's automatically handled in Lazy<T> class.

As a last snippet, we reuse our "Product", "ProductFactory" and "ConsolePictureRenderer" classes from previous sample and change "Main" method to access the product pictures in a multi-threaded environment.
// ======================================================================
class Program {
   // -------------------------------------------------------------------
   static void Main(string[] args) {
      Console.WriteLine("Creating products");
      List<IProduct> products = new List<IProduct>{
         new Product2("Lord of the Rings", null),
         new Product2("Avatar", @"C:\Temp\test.jpg"),
      };

      Console.WriteLine("Loop through products and print picture");
      for (int i = 0; i < 4; i++) {
         ThreadPool.QueueUserWorkItem(foo =>
            products.ForEach(p =>
               ConsolePictureRenderer.PrintPicture(p)));
      }
      Console.ReadKey();
   }
}

Now, products and their pictures become accessed by four parallel threads. A second look into console output shows, all pictures are still loaded only once, since "Product" class initializes Lazy<T> class with isThreadSafe parameter with values true.



Conclusion

.NET 4.0 Lazy<T> class is a nice helper when working with large and/or unmanaged resources. Especially its thread-safe implementation increases the usability in a todays multi-processor environment.

Monday, January 18, 2010

Concatenation of Text and Binary Data in T-SQL

Some month ago, I blogged about "String Split Functions" in SQL Server, today I'll write about the inverse part of this topic; The concatenation of text and binary data of several rows, using T-SQL (and the available extensions). As we'll see, just like string splitting, there are several ways to skin the cat.

Topics covered in this blog entry:
  • Classical, recursive @variable assignment within an SELECT statement
  • FOR XML
  • A WHILE loop and utilizing the BLOB data types WRITE() method
  • SQLCLR
Topics not covered in this blog entry:
  • Recursive @variable assignment within a CURSOR/WHILE-loop. This produces the same memory management as the SELECT based version but in a non-set-based manner.
  • SQLCLR as a aggregate function. Reason will be described later.

Test Data
Since we'll speak about performance in below topics, here are two simple tables containing two different kinds of data to be concatenated.

The below table contains 200 rows with 180,000 characters and bytes per row. This means 36,000,000 characters/bytes will be concatenated.
CREATE TABLE LargeData (
   Id INT NOT NULL IDENTITY(1,1)
      PRIMARY KEY CLUSTERED
   ,Txt VARCHAR(MAX)
   ,Bin VARBINARY(MAX)
);

-- fill the table with 200 rows 180,000 characters per row
WITH Numbers AS (
   SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) Num
   FROM master.sys.all_columns
)
INSERT INTO LargeData
   SELECT TOP(200)
      -- text data
      REPLICATE(CONVERT(VARCHAR(MAX), NEWID()), 5000)
      -- binary data
      ,CONVERT(VARBINARY(MAX), REPLICATE(CONVERT(VARCHAR(MAX), NEWID()), 5000))
   FROM Numbers;

As a second kind of source data we'll work with a table containing 50,000 rows but only 16 characters and bytes per row. This means 800,000 characters/bytes to be concatenated.
CREATE TABLE ManyData (
   Id INT NOT NULL IDENTITY(1,1)
      PRIMARY KEY CLUSTERED
   ,Txt VARCHAR(16)
   ,Bin BINARY(16)
);

-- fill the table with 200 rows 180,000 characters per row
WITH Numbers AS (
   SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) Num
   FROM master.sys.all_columns c1, master.sys.all_columns c2
)
INSERT INTO ManyData
   SELECT TOP(50000)
      -- text data
      LEFT(CONVERT(VARCHAR(36), NEWID()), 16)
      -- binary data
      ,CONVERT(VARBINARY(MAX), NEWID())
   FROM Numbers;

For sake of brevity, I'll provide only one sample statement in the following topics. You can find the complete SQL script as attachment at the end of this blog entry.

Classical @variable Assignment
The classical approach to concatenate data from rows into one column (or variable in our case) is an SELECT statement with a recursive variable assignment.
DECLARE 
   @txt VARCHAR(MAX) = ''
   ,@bin VARBINARY(MAX) = 0x;

SELECT 
      @txt += TextColumn
      ,@bin += BinaryColumn
   FROM AnyTable;
(The “+=”-Operator is a new feature of SQL Server 2008. If you work on SQL Server 2005, just replace it with a variable reassignment like “@txt = @txt + TextColumn”. The “+=”-Operator does exactly the same; it’s just a simplified way to write T-SQL.)

This is generally a good solution due to three facts. It’s easy to use. It’s a set-based solution. It’s fast for less data to be concatenated. On the other hand this kind of concatenation runs into problems if you have to handle many or large data, because all data are copied several times.

A look into Profiler shows a huge resource usage, especially for concatenating of many rows to be handled. A concatenation of only 800,000 bytes or characters caused over 180,000,000 read and 1,400,000 write operations.



FOR XML
This is a kind of tricking another build-in feature of SQL Server 2005 (and further versions). FOR XML is generally made to create XML output from a SELECT statement. However, it can also be used to concatenate data without any XML-style.

I learned this trick on SQLServerCentral.com. One article that shows the usage is Jeff Moden's article about Performance Tuning: Concatenation Functions and Some Tuning Myths. (Special thanks to Bluedog67 for this link :-) .)

Another good resource in web for T-SQL is Mladen Prajdić's page where this trick is shown in "SQL Server 2005: Concat values XML Style".

DECLARE @txt VARCHAR(MAX);
SELECT @txt = (SELECT Txt FROM LargeData FOR XML PATH(''));
Literally the text to be provided within the PATH(‘’) is made to provide the name of a parent XML element used for the output data. If you provide an empty string you get a concatenated text of all data returned by the query.

This is a very powerful solution to concatenate text data. Unfortunately it has to be used carefully if source data are unknown. Since this method works with XML, any kind of functional XML characters like “>” or “&” become masked by this method. Binary data cannot be concatenated by XML PATH().

A look into profiler shows a great resource usage and duration.



WRITE Method
SQL Server 2005 introduced the new BLOB data types VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX). As well as previous BLOB data types (TEXT, NTEXT and IMAGE), also the old BLOB functions like TEXTPTR or UPDATETEXT became marked as deprecated. As a replacement for those functions, SQL Server 2005 introduced a new method called WRITE() which is a data type method - not a usual T-SQL function. This means, it is not called to get a BLOB as a parameter but it is called on a BLOB data type, like methods in a object-oriented programming language.

DECLARE 
   @all VARCHAR(MAX) = ''
   ,@buffer VARCHAR(MAX)
   ,@position INT = 0
   ,@len INT
   ,@i INT = 0
   ,@max INT = (SELECT MAX(Id) FROM LargeData);

WHILE (@i < @max)
BEGIN
   -- get the next row value
   SELECT TOP(1)
         @i = Id
         ,@buffer = Txt
         ,@len = LEN(Txt)
      FROM LargeData
      WHERE Id > @i
      ORDER BY Id;

   -- add the text to the existing text
   SET @all.WRITE(@buffer, @position, @len);
   SET @position += @len;
END;
Unfortunately, there seem to be no way (I know?) to use the WRITE method in a set-based way. It is possible to use it within an UPDATE statement but it is called only once and not for each row of the source data.

A look into Profiler shows, the WRITE method much faster than the classical approach of variable reassignment. It's two times faster with ten times less resource usage for larger data and it's 60% faster with about four times less resource usage with many data to be concatenated.



However, is there a way to get this WRITE method faster than this? As I already wrote above, it can also be used in an UPDATE statement, so this should also be tried with a temp table.
-- Large Binary
DECLARE 
   @all VARBINARY(MAX) = 0x
   ,@i INT = 0
   ,@max INT = (SELECT MAX(Id) FROM LargeData);

DECLARE @buffer TABLE (Bin VARBINARY(MAX));

INSERT INTO @buffer
   SELECT 0x;

WHILE (@i < @max)
BEGIN
   -- get the next row value
   WITH cte AS (
      SELECT TOP(1)
         Id
         ,Bin
      FROM LargeData
      WHERE Id > @i
      ORDER BY Id
   )
   UPDATE b SET
         -- get the next id
         @i = cte.Id
         -- write into temp table
         ,Bin.WRITE(cte.Bin, DATALENGTH(b.Bin), DATALENGTH(cte.Bin))
      FROM @buffer b CROSS JOIN cte
END;
SELECT @all = Bin FROM @buffer;

Sure, the SQL statement is a bit more complicated since we have to use a common table expression to get the (ORDERED) next row of source data, but what's about performance?

A look into Profiler shows, execution for many, small data didn't really change. On the other hand concatenation of large data became about 80% faster with only 30% resource usage. So this - more complicated - way to use the WRITE method might be a good investment if large data manipulation is needed.



SQLCLR
Well, if you don't consider to ever enable CLR in SQL Server this blog ends here :-).

Generally, SQLCLR provides two different ways to perform this concatenation.

The first way would be a user-defined aggregate function which gets the text or binary data to be concatenated. This solution would provide a very handy way to be used like this.
SELECT CONCATENATE_TEXT(Txt) FROM ManyData;
Unfortunately, this solution would require to set the resulting assembly to unsafe mode and server to be configured at TRUSTWORTHY level. Why? All SQLCLR functions (procedures, functions, ...) are static methods on a wrapper .NET class and aggregate functions are called again and again for each row to be proceeded. Within this function CLR only get the current value to be handled, what means CLR would have to remember the previous data anywhere in a static (shared) variable. Due to concurrency calls this would have to handled a custom thread-locking to store and get information. I find SQLCLR provides some powerful extensions but an unsafe assembly tastes bad. Due to this fact, I'll not provide this solution for now. (If anybody is interested in this, please leave a short note and I'll provide this in another blog entry.)

The second way to handle the concatenation is a simple user-defined scalar function which gets the SQL query to be executed. This requires a dynamic SQL approach to use the function, though, it is build-in thread safe because it doesn't require any class member variables.

Here's a C# method to concatenate text data.
[Microsoft.SqlServer.Server.SqlFunction(
   DataAccess=DataAccessKind.Read,
   IsDeterministic=false,
   IsPrecise=true,
   SystemDataAccess=SystemDataAccessKind.None
   )]
public static SqlChars ConcatText (
   // the sql statement to be executed
   [SqlFacet(IsFixedLength=false ,IsNullable=false ,MaxSize=-1)]
   SqlString sqlIn
   ) {

   // create a System.Text.StringBuilder to push all text data into.
   StringBuilder buffer = new StringBuilder();

   // connect to database and execute the provided SQL statement
   using (SqlConnection cn = new SqlConnection("Context Connection=true")) {
      cn.Open();
      using (SqlCommand cmd = new SqlCommand(sqlIn.Value, cn))
      using (SqlDataReader reader = cmd.ExecuteReader()) {
         while (reader.Read()) {
            // get the next string value from reader and append it to buffer
            string str = (string)reader[0];
            buffer.Append(str);
         }
      }
   }

   // return the concatenated string
   return new SqlChars(buffer.ToString());
}

Here's a C# method to concatenate binary data.
[Microsoft.SqlServer.Server.SqlFunction(
      DataAccess=DataAccessKind.Read,
      IsDeterministic=true,
      IsPrecise=true,
      SystemDataAccess=SystemDataAccessKind.None
      )]
   public static SqlBytes ConcatBinary(
      // the sql statement to be executed
      [SqlFacet(IsFixedLength = false, IsNullable = false, MaxSize = -1)]
      SqlString sqlIn
      ) {

      // create a System.IO.MemoryStream to push all text data into
      MemoryStream stream = new MemoryStream();

      // connect to database and execute the provided SQL statement
      using (SqlConnection cn = new SqlConnection("Context Connection=true")) {
         cn.Open();
         using (SqlCommand cmd = new SqlCommand(sqlIn.Value, cn))
         using (SqlDataReader reader = cmd.ExecuteReader()) {
            while (reader.Read()) {
               // get the next binary value from reader and append it to buffer
               byte[] data = (byte[])reader[0];
               stream.Write(data, 0, data.Length);
            }
         }
      }
      // important! reset the stream position to the beginning
      stream.Seek(0, SeekOrigin.Begin);

      return new SqlBytes(stream);
   }

The call from T-SQL is equal to any other scalar function call.
DECLARE @bin VARBINARY(MAX);
SELECT @bin = dbo.ConcatBinary('SELECT Bin FROM LargeData');

Due to the fact that SQLCLR (.NET in general) does not know ANSI text like VARCHAR all non-Unicode data have to be translated into Unicode and back into ANSI. Therefore I extended this tests with two additional executions into NVARCHAR variables instead of VARCHAR.

A look into Profiler shows, only XML transformation is slightly faster than SQLCLR, without any restrictions for functional characters and it fully supports binary concatenation. In case of large text data concatenation into a NVARCHAR variable SQLCLR is even slightly faster than XML.



Conclusion
Classical concatenation by recursive variable assignment is a nice way to handle less, small data. In case of larger or many data it appears to be not the best solution.

New WRITE method is a nice extension for SQL Server, especially when working with larger data.

FOR XML PATH('') is a very powerful trick for text data concatenation. Use it (really!), as long as you keep in mind, it is a kind of a hack and should be used carefully.

We all (including me!) know, there are very less useful ways to use SQLCLR. But concatenation of data as well as string splitting are some of those and might be considered if this is a common task in a SQL Server 2005/2008 environment.

Attachments

Post Scriptum
Thanks again to Bluedog67, who showed me another blog about string concatenation by Anith Sen at:
Concatenating Row Values in Transact-SQL
It also covers the CLR aggregate function.

... didn't know this blog entry before...