It all started with serialization

towerOfBableI am starting this series of posts about software design topics with one of the oldest problems in computer history: serialization.
My intention is primarily to discuss about rather general purpose topics and technologies in software design. I want to compare the typical business requirements, the available technologies, and how close or far these technologies are in meeting those requirements and being good tools for the programmers’ job. The goal is one: through discussion and sharing, find how we can use the tools and technologies at their best, or whether we need to evolve the tools and technologies. So, as a first exercise, I am practicing this pattern on this relatively simple topic.
The number one purpose of computers and servers has been, for a very long time, primarily the storage and retrieval of data. In order to do databases, you need a couple things at least: searching and sorting algorithms, and storage of data. The glue that keeps this all together is the transformation between the storage format and the runtime format, each one optimized to suit different needs.

In general all large applications deal with large amounts of data. In the past, the database was a separate unique application with the specific goal of storage and search. Modern large software structures include layers that often nest databases inside them. Layers are grouping of related components, separated from other layers by some form of data contracts. Large applications can thus be seen, from a certain distance, as boxes (layers) that transform data between different formats: database records, web requests, business objects, ui form fields. Computer programs are usually not “creative” by themselves; computers are not required to be original, funny or anything like that. Keeping this bird’s eye view on large software structures make a common pattern emerge: what goes in comes out. No randomness and no creativity means that the sum of inputs is always equal to the sum of outputs, unless your software is leaky. In my mind, this actually tells that the number one concern of software design is exactly this copying of data to different formats, which is also known as serialization. So software is all about serialization, then on top of that you add some domain specialists and voila, you have a software product. Software is the art of copying data. Given that nothing is added or removed while copying, business rule tells that software is mostly a burden. It’s a bit like burocracy, a lot of forms to fill, a lot of travelling between different offices. The larger the application gets, the longer the corridors and the more time spent on paperwork. But as with burocracy, there are certainly reasons for this.

why do we serialize?

The first, more easily justifiable reason for all this copying around lies in technology limitations. Large software ecosystems are developed with a plethora of different technologies.
Often data has to travel across different processes running on the same machine, or maybe jump around machines. When developers are lucky, they are tasked with the development of a product using one set of tools and one language, but most often sofware is composed of different parts that don’t like to talk to each other in plain simple english. Say, for example making javascript talk to c++ is not really like calling a function, right? Data coming out of databases is also famous for not being object-oriented friendly. Much has been discussed about the “impedance” or “friction” between databases and OOP.
All these boundaries between machines, processes, different languages, database and object orientation lead to the challenge of finding a way to ease the long journey of the bit from disk to screen. There’s always a man in the middle of each of these boundaries, acting as the interpreter. Bewteen databases and object orientation you get the ORM guy (object relational mapper). Between processes and machines there’s RPC, and between C++ and javascript there are web services. All these men in the middle use specific description of the data, in a format which is easy to translate to each end of the communication.
Another instance of the cross machine boundary lies inside most modern devices in the form of boundary between CPU and GPU. Finally, for those of you who know about the environment in a large software company, there are boundaries between departments and between rooms. There are walls, firewalls and communication issues. All these are usually reflected in software architectures via the over-layering syndrome and the need for massive serialization infrastructures. Large burocratic companies tend to produce serialization heavy code I would say.

The second reason for serialization stems from the complexity of large scale software. I will not discuss about the paradoxical aspect of this. Software gets large either because of feature bloat, and that’s understandable as the customer gets more features, or because of indigenous reasons. That is, as a living being, software develops structures to sustain more complexity, and those structures in turn need more scaffolding and at the end you’ve built burocracy into software. But going back to the second valid reason for serialization, or transformation of data: different layers in the application focus on different concerns and thus require a different data format to be built and/or to operate efficiently.
The business logic in an application requires domain specific models, the storage layer mostly requires fields and storage keys to retrieve data, the UI needs views (viewmodels in recent jargon) that collect data in a sexy way to be shown on screen. All these transformations are serializations, and the deeper you travel down the layer structures, the more the serialized models look opaque obscure and imperscrutable, that is closer to the old binary way of doing serialization.

Other reasons include versioning and immutability. Even when monolithic applications store data, even without crossing thick layered architectures, it’s most often required that data saved with version X will be usable also on version Y. So even sitting in the quiet and safe pond of monolithic desktop applications, your data will have to travel time and space, and thus will escape the faith of all data: serialization and deserialization. Versioning is definitely a reason to design specific data models. And finally you have immutability. Data often changes during the application lifetime. Taking a snapshot of data allows to process that crystallized part of data without consistency concerns. Immutable data is very good for parallel processing as no access control is required across threads.

There are thus good reasons to serialize, or we could say good reasons to write software, which is the same. But as with any engineering feat, the devil lies in the details. What are the devilish details and concerns in this case?

Serialization concerns

Whose job is serialization?

We realized we need to serialize, because we have different data formats. But who’s going to take the pain? If A is talking a different language than B, should A or B go to school and learn a new language/format, or etiquette of communication? If you put the question in human terms, the answer becomes immediately apparent. If I need to ask something to my boss, should I expect him/her to understand my nerdish technical blabbering or rather should I explain in human, bossy readable terms? If one of the parties is cleary leading, the language and format of choice is not a choice, and thus the other party adapts and does the translation/serialization. Between busy and creative peers as we are, it is instead convenient to find a common language. Since we are peers, we’ll both take the time to learn and do our own part of serialization. Finally, if I need to communicate with a client on the other side of the language spectrum, I’ll probably ask help either to Google or to a human intermediary. Same stories hold for humans and software components.

roundtripping

This is a sneaky issue in serialization. To keep sailing the human relationships metaphor, it’s like me communicating to a native of some paradise island who has never seen bad weather. My message might be foggy, rainy and contain a lot of slippery questions. I might still be able to get the core of the message through, but many details will be lost. Let’s assume I have chosen this native as an intermediary to talk to an english speaking on the same time zone as where the native resides. My foggy and rainy message will get translated to a shiny sunny story. When this story reaches the intended end of the communication, that is the native english speaker, the message will be very different than my original english text. Now, the english speaker will answer my message and attach it for reference to the communication thread. When the communication thread will get back to me, I will not even be able to understand what the source was. This is the roundtrip issue. When serialization formats are not perfectly compatible, some data might be lost in translation. As a more software specific example, consider this: I am sending more data to the database than it expects. For example, there’s more fields in the records I am trying to store. Since I took precautions server-side, the program will not choke and will instead happily accept the data but discard the extra fields. Ignorance of the extra fields is excusable as the database is an older version, so the client should be tolerant enough and accept that it will not be able to query the new fields. But what will not be acceptable is the loss of data. Even if the server does not know how to handle the extra unkown data, it should be able to return that extra data untouched.

polymorphism/extensibility

The roundtrip concern is a special case of extensibility. More in general, serialization meets the challenge of extensibility when the data formats change but one of the two parties is slow or cannot adapt to the new format. As with the roundtrip case, the main requirement is not to lose any piece of data. In case the extension comes in the form of object oriented inheritance, a nice-to-have feature is also the automated support for adding derived classes to the serialization process. Say for example two components are communicating via a serialized version of the Fruit class. Banana and Apple class derive from fruit, and since their support was envisioned right from the start full details about these fruits can be communicated around. If I add Papaya, I would like the system to handle this fruit as well, since it’s just another fruit maybe with an extra property or two, but hey, nothing revolutionary in business terms. At the very minimum, if I store and retrieve a Papaya, I will be able to get back all data. Maybe I won’t be able to search all Papayas in the system, but at least I know that Papayas are safe in the storage room. A good system might also give me the chance of at least treating the Papaya as a fruit. That is, at least I will be able to use generic fruit processing methods after serializing/deserializing. Ideally object oriented languages should handle this case in very elegant fashions. Object orientation is born as better way of handling data, so it should also offer nice and flexible object serialization support. This is not always the case in OOP languages and frameworks though.
Now that we layed out the requirements, the motivations and the potential pitfalls of serialization, let’s see how languages tackle the problem and if we need to do anything on top of native support. From a technical standpoint I will be referring mostly to c#, but conceptually the story is the same in different languages.

Solutions

Fully opaque serialization

Serialization is fully opaque when it’s like obscure burocracy. When it’s impossible to understand what is serialized, that is. It’s like me scribbling some sticky notes to keep track of software bugs, as opposed to writing down human readable stories in bugzilla. For me, the advantage of my obscure stickies is that they are very efficient. Of course I am the only one who can understand them. Going back to software terms:
PRO: each class can write/read its internals without exposing them to anybody else.
CON: ties serialization into the data model implementation. If there is one single serialization model in the application, it might be fine. Often though serialization formats are multiple, like database records vs JSON for web services vs. some internal file formats for disk storage. Since the data models are the contracts between application layers, they should not carry too many technology dependencies and thus opaque serialization is not practical in cases of growing complexity. On the other hand, if the serialized format does not cross application layers, it might be perfectly fine to serialize using e.g. binary formats. So, here’s an advantage of monolithic applications: the use of opaque serialization simplifies structuring code.

Aspect based programming

It is argued that serialization is a cross-cutting concern, and thus is better tackled by aspect programming.
Serialization done the aspect oriented way uses attributes. The .Net framework exposes the DataContractSerializer API to do serialization the aspect way. In order to enable a class for serialization using DataContractSerializer, decorations are required on top of the plain data models.
A class with a single property will change from this:

class A {
int SomeProperty {get;set;}
}

to this, when decorated for serialization:

[DataContract]
class A {
[DataMember]
int SomeProperty {get;set;}
}

PRO: serialization by attributes does not require explicit serialization and deserialization/parsing code to be written. The framework can do most of the work for us. Given that attributes can be inspected via reflection, the framework or our code can extract the “contract” of the data model we are serializing. Standard data contracts, e.g. SOAP ones, can be used to reconstruct the shape of data in a format native to each language that supports that type of protocol.
CON: Pollutes the data model with serialization information. The real problem is that complex data models might not be very well fitted for serialization. E.g recursive references, or references to large context objects from within child models might be convenient for runtime usage. When serialized, you don’t want to re-serialize large objects referenced from multiple places though. In case of complex models/large data, you might end up creating a set of alternative data models just for serialization purposes, so that your runtime models can still provide the level of usability you expect. That’s where “smart” support for serialization from the framework side really ends up serving little purpose. If I write a model just to be serialized, it better be entirely serialized without any attribute (opt-out none).

Serialization data model

Instead of using the provided support for opaque serialization or aspect-oriented serialization provided by the language, we could opt for creating our own model for serialization. We would need to design into it the following characteristics:
– must be agnostic of source or destination formats. This model will always be the man in the middle.
– must be possible to store and retrieve the serialized format using different storage services, like databases or xml files or web services.
PRO: this model covers only the serialization/persistence use cases. No eventing is required, no tracking of changes as the model is always just a snapshot. Most languages refer to such simple objects as “plain old language objects”, POJOs for Java, POCOs for the C family, etc. The basic implementation is in fact a property bag, or lookup/dictionary of key attribute pairs. By carrying explicit information, the serialization data model allows for easy further tansformation to other serialization formats. The first example is the transition to the relational database world. Field name/attribute value collection is basically the description of what a SQL record is, and the role of the ORM (object-relational mapper) such as (N)Hibernate is just to ease such transition between the relational and the object oriented world. Data access layers such as Microsoft’s ADO employ generic property bag objects (records). These serialization models live in the object-oriented world, but in fact look a lot more like old C structures than C++ classes, as they do not encapsulate logic. Having one single well defined scope is usually a good quality of an object oriented class, though this essentially means that encapsulation of logic and data needs to be violated here. In general I think that object oriented languages require patterns, practices and rules that go well beyond their syntax to be used effectively. It almost feels like object orientation is a way of defining meta-languages which need to be further specified for practical usage. This is one of such areas: a class can be data, logic, a combination of both, without any lexical or syntactical difference but with vastly different usage patterns.
CON: the dreaded property bag pattern might lead to a set of problems of its own. Keys need to be maintained, mapping of fields to databases become areas of concern. If persistence, transfer of data, querying and sorting data is a primary concern of your application, you probably are already aware that a data layer is where all such generic data-centric behavior will/does reside. A reasonable way of splitting away the data layer of such structured application is to have data-specific objects on the boundaries of such layers. If data management is a secondary concern in your domain, choosing the “data object approach” will not be easy to justify from a business use case perspective. I have already argued that most interesting applications will have to deal with a lot of data, but before your users will be confronted with a lot of data, your app needs to become a successfull product generating revenues before data. On the other hand, adding a data layer and structures after an application (or maybe even a family of applications) have become a success is more difficult than baking a data mindset into the architecture right from the start.

Choices, choices

Choice 1, trying to be opaque, or trying to write as little code as possible

If your only need is true temporary serialization/deserialization, use the simplest serialization tools provided by the language. In the c# case, I believe this boils down to the choice between XmlSerializer and DataContractSerializer which will easily serialize your objects to an xml string. The first one does not mandate additional attributes such as the [DataMember] one, and will just serialize everything. The second one requires quite a bit of attribute decorations. Both these tools choke on extensibility though. If you need to serialize derived classes you will need to use special extensions to the serialization mechanism. In essence, you need to specify all types that the serializer can expect. If you go along this route, watch out for the extensibility bumps.

Choice 2, create your own serialization model

With this option you create two classes, one for your data model and another one for your serialization data model. The serialization-specific data model is typically not used outside data persistence code; the application uses a more meaningful and application specific representation of data. This approach is definitely not a DRY one. A data-centric application will simply take the pain and organize around the needs of structured data access. If you can’t afford not to be DRY, there’s a solution to avoid repeating yourself over and over again: deriving the application data models from the serialization models and use some language trickeries to ease the access to the serialized properties. Purely as an example, I am showing here the implementation of such approach in C#. What follows is a simplified version of a small configuration framework I wrote. This configuration framework starts from the requireiment of generic access to configuration properties of application modules, easy browsing and sorting, and a centralized handling of copies. These common data-centric requirements support the creation of a generic configuration object. On the other hand, individual application modules define their own specific properties in their configuration data.

The implementation

All is based here on a generic, easily serializable property bag. What gets serialized at the end is just a list (dictionary) of string keys to object value mappings. Since property-bag style programming is not nice, the base property bag class takes care of converting standard class properties to and from the property bag mappings. The two methods that do the trick are the OnSerializing, which reads the properties of the class via reflection, and the InitFrom which is a sort of constructor that fills in the actual property values from the property bag, again via reflection trickeries.

 

 [DataContract]
    public class PropertyBag {
        /// <summary>The actual property bag, filled at serialization time</summary>
        [DataMember (Name="Properties")]
        private Dictionary<string, object> serializedProperties =
            new Dictionary<string, object>();

        public Dictionary<string, object> SerializedProperties { get { return serializedProperties; } }

        /// <summary>Copy properties from the serialized property bag to the current object</summary>
        /// <param name="other"></param>
        public void InitFrom(PropertyBag other) {
            var publicProperties = GetType().GetProperties(
                BindingFlags.Instance | BindingFlags.Public | BindingFlags.FlattenHierarchy
            ).ToDictionary(propInfo => propInfo.Name);
            foreach (var nameValuePair in other.serializedProperties) {
                PropertyInfo clrProp = null;

                if (publicProperties.TryGetValue(nameValuePair.Key, out clrProp)) {
                    clrProp.SetValue(this, nameValuePair.Value);
                }
            }
        }


        [OnSerializing]
        internal void OnSerializing(StreamingContext context) { 
            var publicProperties = GetType().GetProperties(
                BindingFlags.Instance | BindingFlags.Public | BindingFlags.FlattenHierarchy
            );
            foreach (var property in publicProperties) {
                if (property.Name == "SerializedProperties") continue;
                serializedProperties[property.Name] = property.GetValue(this);
            }
        }       
    }

The property bag can be serialized to xml, json or any other format very easily as it’s not opaque, yet it has a very simple format. Because of this quality, I have kept the actual serialization code outside the class and implemented a PropertyBagSerializer. This way, one can easily swap out the XML DataContract serialization style for a more compact one such as JSON. The serializer here tackles another one of the tricky aspects of serialization, that is extensibility on inheritance. All classes derived from PropertyBag are serialized as PropertyBag, preserving all data stored in their properties though. In .net APIs, this is achieved by means of a datacontract resolver that maps all derived types to the base PropertyBag type. Please note that it would be possible to reconstruct the proper type in this custom serializer by means of an additional property in the serialized propertybag and by using the InitFrom. Left as an exercise…

public class PropertyBagSerializer {
        private DataContractSerializer serializer;

        public string Serialize(PropertyBag bag) {             
            MemoryStream memStream = new MemoryStream();
           
            Serializer.WriteObject(memStream, bag);
            memStream.Seek(0, SeekOrigin.Begin);
            return ASCIIEncoding.Default.GetString(memStream.GetBuffer(),0,(int)memStream.Length);
        }

        public PropertyBag Deserialize(string text) {
            MemoryStream memStream = new MemoryStream(ASCIIEncoding.Default.GetBytes(text));
            XmlReader reader = XmlReader.Create(memStream);
            return Serializer.ReadObject(reader) as PropertyBag;
        }

        private class DeserializeAsBaseResolver : DataContractResolver {
            public override bool TryResolveType(Type type, Type declaredType, DataContractResolver knownTypeResolver, out XmlDictionaryString typeName, out XmlDictionaryString typeNamespace) {
                bool result = true;
                if (typeof(PropertyBag).IsAssignableFrom(type)) {
                    XmlDictionary dictionary = new XmlDictionary();
                    typeName = dictionary.Add(typeof(PropertyBag).Name);
                    typeNamespace = dictionary.Add(typeof(PropertyBag).Namespace);
                } else {
                    result = knownTypeResolver.TryResolveType(type, declaredType, null, out typeName, out typeNamespace);
                }
                return result;
            }

            public override Type ResolveName(string typeName, string typeNamespace, Type declaredType, DataContractResolver knownTypeResolver) {                
                return knownTypeResolver.ResolveName(typeName, typeNamespace, declaredType, null) ?? declaredType;
            }
        }

        private DataContractSerializer Serializer {
            get {
                if (serializer == null) {
                    serializer = new DataContractSerializer(
                        typeof(PropertyBag), null, Int32.MaxValue, true, false, null,
                        new DeserializeAsBaseResolver()
                    );
                }
                return serializer;
            }
        }
    }

Finally, here’s a sample data model that derives from PropertyBag, and code snippets that show how to use the serializer/deserializer.

 public class SimpleData : PropertyBag {
        public string Name { get; set; }
        public int Age { get; set; }
    }
/// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window {
        public MainWindow() {
            InitializeComponent();
        }


        private void OnSerializeClick(object sender, RoutedEventArgs e) {
            SimpleData data = new SimpleData { Name = SourceName.Text, Age = int.Parse(SourceAge.Text) };
            PropertyBagSerializer serializer = new PropertyBagSerializer();
            Serialized.Text = serializer.Serialize(data);
        }

        private void OnDeserializeClick(object sender, RoutedEventArgs e) {
            PropertyBagSerializer serializer = new PropertyBagSerializer();
            var data = serializer.Deserialize(Serialized.Text);
            
            // generic property access
            TargetName.Text = data.SerializedProperties["Name"].ToString();
            TargetAge.Text = data.SerializedProperties["Age"].ToString();

            // recreate actual type
            SimpleData simpleData = new SimpleData();
            simpleData.InitFrom(data);
            TargetName.Text = simpleData.Name;
            TargetAge.Text = simpleData.Age.ToString();

        }     
    }

 

Conclusion

As much as serialization is one of the most ancient computer science problems, the solutions provided by new languages are still evolving. There’s not one single ideal, catch-all solution, exactly because software is all about different data formats and transformations. This article explained some of the reasons for such complexity, some of challenges to watch out for, and proposed one possible solution as an example of concrete implementation of one set of serialization requirements. The solution addresses flexibility, extensibility and genericity. By paying the performance penalty of reflection, we get much simpler code that does not need to worry about serialization at all. To really conclude, I think that although the framework support (and this holds true for c++ even more than .net) is not completely removing the friction of serialization, by adding some APIs and some structure ourselves, we can get close to the ideal solution in many circumstances. In the next posts, we’ll keep tackling problems only partially solved by frameworks, and building our own apis and tools to close the circle.

Hello, world

ImageHere we go, first blog post. This place, in my mind, is where we’ll discuss ideas about software design, architecture, and tooling. I will pick examples from different technologies to show how close the current state of the art is to where we, software designers, need to be. I have simple, quite generic topics in my mind, and also some more niche technology challenges. My goal is to collect the best ideas from different people using different frameworks. The tools and APIs we use are always the result of a compromise between technological requirements and commercial needs. The software giants who have the resources to build platforms, operating systems and tools try to tie the consumers, developers and end users alike, to their ecosystems. Even Google, who was once a big supporter of open source, is now in a more awkward position due to the popular Android platform. The best world for software engineers would be one where a software service I develop can be consumed by as many people as possible in the simplest possible manner. If it were possible to write once, run anywhere, how great! In practice though nobody can drive big initiatives and invest in this direction at this point in history. Java as a desktop framework has failed. It might be considered a success in the mobile space thanks to Google, but it’s an “Android version” of Java, and thus surely not intended to be run anywhere. Silverlight was a really nice attempt, but it was born at the wrong time, while the bubble was going to burst and html5 and dynamic languages were fully taking off. Now the evolution of the software environment is much more fragmented, large platforms are still there but often developers are confronted with collection of smaller frameworks that need to be put together, and glued with a bit of art and craft. Tools are rich, but less consistent, especially in the web space. Even in the more limited desktop application world, it’s hard to find a technology onto which we could bet for more than a few years from now. So we need to be pragmatic and use the best for our software businesses today. As software engineers though, we are also naturally attracted to the beauty of abstractions which make attacking large problems possible. It’s a universe of micro-apps, but also a world of big data services that keep people together. This blog will ultimately be about the ideal development environment, intended as a combination of tools, apis and patterns. Ideal, but not less concrete, as we will try to find solutions to get us as close as possible to this IDE.