Hierarchical Data Lookups with Ansible

A new lookup plugin for Ansible that uses Jerakia to do hierarchical data lookups right from your playbook.

Jerakia is an open source, highly flexible data lookup tool for performing hierarchical data lookups from numerous data sources. It’s aim is to be a stand alone tool that can integrate with a wide variety of tools. In this post I will examine how to integrate it with Ansible to be able to perform powerful hierarchical lookups right from your playbook.

Hierarchical lookups

I recently wrote an in-depth look at Hierarchical data lookups. What we mean by hierarchical lookups is essentially a key/value data lookup that traverses a hierarchy of queries until it finds an answer. This enables us to define data at a global level and then override the values at different points of our hierarchy depending on the scope (ansible facts) returned from the node, this model of lookup is particularly suited to infrastructure configuration. The hierarchy is configurable based on whatever is appropriate for your environment, you may for example define variables at the global level and then override them depending on the operational environment, location or role of the requestor.

The difference between this and traditional lookups, or dynamic inventories, is that you do not have to collate and organise the data itself against each host prior to using it. For example if you are overriding a setting based on environment and location you do not have to build a data set that defines which hosts are in which environment or location, that data is already available in the facts that Ansible gathers at runtime, and is used dynamically when performing a lookup against the hierarchy tree.

A simple Ansible playbook

Let’s start with a simple Ansible playbook to manage NTP. I’m not sure why everyone uses NTP as the go-to configuration management example, but who am I to argue with such a well established convention. In this example we’re going create a playbook that manages the servers defined in /etc/ntp.conf.

Here is our playbook:

And our corresponding template:

If I run this, all my hosts get an identical ntp.conf:

Overriding the behaviour

Depending on certain characteristics of the node we are configuring, we may want to configure different NTP servers. In this example I want to retain these defaults but easily override these values depending on the operating environment that a host is in, and further more to be able to further override that for a specific host if I counter a one off edge case.

Ansible facts

For this example, we are going to work with the ansible fact ansible_nodename and a custom fact called environment that is available on the nodes in facts.d;

Ansible lookups

Ansible has several sources of runtime data, the integration with Jerakia for performing hierarchical lookups is currently possible by a lookup plugin (ansible-jerakia). Lookup plugins can be used within Ansible playbooks to call out to an external data source, in this case Jerakia, to populate variables with data. With Jerakia, we pass on information about the node in the lookup request, this information is gathered dynamically from the nodes facts. We refer to this collection of information as the scope. Rather than saying lookup the value of this key, we are now saying Lookup the value of this key in the context of this scope

Configuring Jerakia

I’m going to assume you already have a Jerakia instance running, and for simplicity it is on localhost and the port is default.

When a lookup request for a key is received by Jerakia, it uses one or more lookups contained in a policy to search for the data, policy files are Ruby based and live in the configured policy.d folder. Let’s start with a fairly basic Jerakia policy that defines a global layer and then overrides on environment and the specific hostname of the machine. Something like

Using this configuration, when Jerakia receives a lookup request, it will first check to see if there is any matching data at the hostname level specific to that node, if not it will fall down to the next level and check if there is any matching data in the corresponding environment that the node belongs to before finally falling down to the “global” layer.

Finally, you should create a token so that Ansible can authenticate against the Jerakia API

Integrating Ansible with Jerakia

Once you’ve installed the lookup plugin into the correct location, you should create a jerakia.yaml file at the root of your playbook. In this file we specify the Jerakia server with authentication details, and also pass on the scope values we are going to need to perform lookups.

If you refer back to the Jerakia policy file above, you’ll see that in the scope section of our configuration here we are mapping Ansible facts to keys that will be available in the scope from within the Jerakia policy. Nested fact values can be referenced using the dot-notation as in the above example

Add some data to Jerakia

We’ll start off by adding our default NTP servers to Jerakia, at the global level so it applies to everything. We will create the key servers in the namespace ntp so under the data directory we need a directory called global and a YAML document inside that directory containing the keys within the ntp namespace.

We can now validate that this is correct by performing a Jerakia lookup on the command line

Look up data from the Ansible playbook

Now we have Jerakia running and the lookup plugin in place we can modify our playbook to lookup the value for the NTP servers array from Jerakia rather than hard coding it in the playbook. The Jerakia lookup plugin takes an argument of the namespace and the lookup key separated by a ‘/’.

If we run this playbook now, things should stay as they are, but the value of the NTP servers array is now coming from Jerakia.

Overriding data in the hierarchy

Now if we wish to keep these defaults for NTP servers across our estate, but we want to override these in the production environment we can do this simply by adding more data to the Jerakia hierarchy in the environment level. Jerakia will search for data within the environments directory under the sub-directory corresponding to the environment of the node. It will do this before hitting the global level, so any data defined here will win.

Now we can see on the command line that we’re able to get different results from Jerakia by feeding it a different environment

Now if I re-run my Ansible playbook from before, I should see that servers in the production environment (in this case ansible3) should be configured differently.


In our example we override based on environment and then hostname, but in your organisation this could be completely different, maybe it makes more sense to override on role, location, datacenter..etc. Building the right hierarchy to suit your infrastructure needs is crucially important.

This post touches the surface of Jerakia. It has many more advanced features, such as cascading lookups that traverse through a hierarchy and return a consolidated set of data, Vault integration for secrets management and a very powerful flexible configuration. Also, it’s pluggable so you can use the same interface to retrieve data from a variety of sources, including file based hierarchies, HTTP APIs and databases.

Jerakia is pretty established already but the integration with Ansible is still fairly new, any feedback on how to improve the integration would be greatly appreciated. More information on Jerakia can be found here and the Ansible lookup plugin can be downloaded from GitHub

Follow and share if you liked this

Understanding Hierarchical Data Lookups

An in-depth explanation of hierarchical data lookups and how it relates to infrastructure configuration

In this post I’m going to explain a concept called hierarchical data lookups. That’s a term that could mean different things in different contexts, so to be clear, although this is a pretty general concept I’m going to discuss hierarchical lookups how they relate to Jerakia, an open source data lookup tool. I’ll start by looking at what we mean by hierarchical lookups and then move on to how that is relevant to infrastructure management and other uses.

Defining hierarchical lookups

In essence, Jerakia is a tool that can be used to lookup a key value pair, that is to say, given a key it will give back the appropriate value.  This is certainly nothing special or new, but the crucial difference here is the way in which the data is looked up.  Rather than just querying a flat data source and returning the value for a requested key, when doing a hierarchical lookup we perform multiple queries against a configured hierarchy, transcending down to the next layer in the hierarchy until we find an answer.    The end result is that we can define key value pairs on a global basis but then override them under certain conditions based on the hierarchical resolution by placing that key value pair further up the hierarchy for a particular condition. Let’s look at a fairly simple example, you need to determine from a data lookup what currency you need to bill a user coming into your site.  You already have data that tells you which country and continent that your user is based in.  You determine that to start with you will bill everyone in USD regardless of where they come from.  So you store the key currency with a value of USD in a data source somewhere, and whenever a user starts a transaction, you look up that key, and they get billed in USD.

Now comes the fun part.  You decide that you would like to now start billing customers from European countries in EUR.  Since you already know the continent your user is coming from you could add another key to your data store and then use conditional logic within your code to determine which key to look up, but now we’re adding complexity within the code implementing conditional logic to determine how to resolve the correct value.  This is the very thing that hierarchical lookups aim to solve, to structure the data and perform the lookups in such a way that is transparent to the program requesting the data.

Lets add another layer of complexity, you’ve agreed to use EUR for all users based in Europe, but you must now account for the UK and Switzerland which deal in GBP and CHF respectively, and potentially more.  Now the demands for conditional logic on the program requesting the data are getting more complicated.  To avoid lots of very convoluted conditional logic in your code you could simply map every country in the world to a currency and look up one key, that would be the cleanest method right now.  But remember that we generally want to use USD for everyone and only care about changing this default under certain circumstances.  If we think about this carefully, we have a hierarchy of importance.  The country (in the case of UK or Switzerland), the continent in the case of Europe and then the rest of the world.  This is where a hierarchical lookup simplifies the management of this data.  The hierarchy we need to search here is quite simple;

When we’re dealing with storing data for hierarchical searches, we end up with a tiered hierarchy that looks something like this.  A lookup request to Jerakia contains two things, they key that is being looked up, in this case “currency”, and some data about the context of the request which Jerakia refers to as the scope.  In this instance the scope contains the country and continent of the user. The scope and the key are used together, so when we are talking about hierarchical lookups, rather than just saying “return the value for this key” we are saying “return the value for this key in the context of this scope”.  That’s the main distinction between a normal key value lookup and a hierarchical lookup.  If you’re thinking of ways to do this in a structured query language (SQL) or some other database API, you might be ok to solve this problem – but this is a stripped down example looking up one value, now imagine we throw in tax parameters, shipping costs, and other fun things into the mix – this becomes a complex beast – but not when we think of this as a simple hierarchy.

With a hierarchical data source we can declare a key value pair at the bottom level, in this case Worldwide.  We can set that to USD, at this point any lookup request for the currency will return USD.  But a hierarchical data source allows us to add a different value for the key “currency” at a different level of the hierarchy, for example we can add a value of EUR at the continent level that will only return that value if the continent is Europe.  We can then add separate entries right at the top of the hierarchy for the UK and Switzerland, for requests where the country meets that criteria.

From our program we are still making one lookup request for the data, but that data is looked up using a series of queries behind the scenes to resolve the right data.   Essentially the lookup will trigger up to three queries.  If one query doesn’t return an answer (because there is nothing specific configured in that level of the hierarchy) then it will fall back to the next level, and keep going until it hits an answer, eventually landing at the last level, Worldwide in our example.   So a typical lookup for the currency of a user would be handled as;

What is the value for currency for this specific country?
What is the value for currency for this specific continent?
What is the value for currency for everything worldwide?

Whichever level of the hierarchy responds first will win, meaning that if a user from China will get a value of USD – because we haven’t specified anything for Asia or China on the continent or country levels of the hierarchy so the lookup will fall through to our default set at the “worldwide” level.  However, at the continent level of the hierarchy we specified an override of EUR for requests where the continent of the requestor is Europe, so users from Germany, France and Spain would get EUR.  This wouldn’t be the case for the UK or Switzerland though because we’ve specifically overridden this at the country level, which is higher in the hierarchy so will win over the continent that the country belongs to.

So hierarchical lookups are generally about defining a value at the widest possible catchment (eg: worldwide) and moving up the hierarchy overriding that value at the right level.

What is key here is that rather than implementing three levels of conditional logic in our code, or mapping the lowest common denominator (country) one to one with currencies for every country in the world (remember in some cases we may not be able to identify the lowest common denominator) we have found a way to express the data in a simple way and provide one simple path to looking up the data.  Our program still makes one request for the key currency, the logic involved in resolving the correct value is completely transparent.

In this case, we had a scope (the country and continent of the requestor) and a hierarchy to search against that uses both elements of the scope and then falls back to a common catch all.

Applying this to infrastructure management

Jerakia is standalone and can be used for any number of applications that can make use of a hierarchical type of data lookup, but it was originally built with configuration management in mind.  Infrastructure data lends itself incredibly well to this model of data lookup.  Infrastructure data tends to consist of any number of configurable attributes that are used to drive your infrastructure.  These could be DNS resolvers, server hostnames, IP addresses, ports, API endpoints…. there is a ton of stuff that we configure on our infrastructures, but most of it is hierarchical.  Generally speaking a lot of infrastructure data starts off with a baseline default, for example, what DNS resolver to use.  That could be a default value thats used across the whole of your company and you add that as a key value pair to a datastore.  Then you find yourself having to override that value for systems configured in your development environment because that environment can’t connect to the production resolvers on your network, you then may deploy your production environment out to a second data centre and you need that location to be different.  But we are still dealing with simple hierarchies, so rather than programming conditionals to determine the resolution path of a DNS resolver we could build a simple hierarchy that best represents our infrastructure, such as;

When dealing with a hierarchy like this, a data lookup must give us a key to lookup and  contain a scope that tells us the hostname, environment and location of the request. Using the same principles as before our lookup will make up to 4 queries;

What is the DNS resolver for my particular hostname?
What is the DNS resolver for machines that are in my environment?
What is the DNS resolver for machines that are in my location?
What is the DNS resolver for everyone else?

Again, this is hierarchical search pattern that will stop at the first query to return an answer and return that value.  We can set our global parameters and then override them at the areas we care about.  We’ve even got a top level hierarchy entry for that one edge case special snowflake server that is different from everything else on the network, but the lookup method is identical and transparent to the application requesting the data.


I’ve tried to give a generic overview of hierarchical lookups, but in particular as they relate to Jerakia.  Jerakia has way more features that build on top of this principle, like cascading lookups which don’t stop at the first result and will build a combined data structure (HashMap or Array) from all levels of the hierarchy and return a unified result based on the route taken through the hierarchy, and I’ll cover those in a follow up post.  It’s also built to be extremely flexible and pluggable allowing you to source your data from pretty much anywhere and ships with an HTTP API meaning you can integrate Jerakia with any tool regardless of the underlying language.

Our focus has been very much in the configuration management space particularly integration with Puppet and Hiera, and more recently a lookup plugin for Ansible.  But Jerakia could be used for any number of applications where data needs to be organised hierarchically without introducing logic into the code.

It’s an open source project, please feel free to contribute or give feedback on the GitHub site

Follow and share if you liked this

Designing modules in a Puppet 4 world

Puppet 4.0 has been around for a while now, and most of it’s language features have been around even longer using the future parser option in later versions of the 3.x series. Good module design has been the subject of many talks, blog posts and IRC discussions for years and although best practice has evolved with experience, fundamentally things haven’t changed that much for a long time… Until now.

Puppet 4.x introduces some of the biggest changes to the Puppet language and module functionality in a single release. Some of these new features dramatically change how you should be designging Puppet module to leverage the the most power out of Puppet. After diving in deep with Puppet 4 recently I wanted to write about some things that I see as the most fundamental and beneficial changes in how modules are now written in the new age.

Type Casting

Is it a bird?, is it a plane?… no it’s a string. The issue of type casting has been a long standing irritation with Puppet. Theres a lot of insanity to be found with type casting (or lack thereof) within Puppet that I won’t go over here, but Stephen Johnson covered this really well in his talk at Puppet Camp Melbourne a while back. Puppet’s inability to natively enforce basic data types was confusing enough within the scope of Puppet itself, but when you add in ERB templates and functions these issues bleed over to the Ruby world which is much less forgiving of Puppets whimsical attitude to what a variable actually is and there have been countless problems as a result. Fortunately, Puppet 4.0 has finally come to address this shortcoming and this really is a great leap forward for the Puppet language.

Previously we would create a class with parameters, random variables that might be any number of types. And then, if we were being careful we could use the validate_* functions from stdlib to check that the provided values met the required type, or we could just blindly hope for the best. Now Puppet 4.0 has an in-built way of doing this by being able to define the types of variables that a parameterized class accepts. Eg:

For the most part I’ve found this to be an excellent addition, it’s much easier to define your data types when writing classes and seeing what types a class supports has become a whole lot more readable and removes the need to use external functions to validate your data. There still seems to be some quirky behaviour with it though that makes me wonder how solid these data types are under the hood, the difference between a string and an integer isn’t as well enforced as you might expect in other languages, take this example;

[Edit: see comments]. But quirks aside, using data types in your class will help with validation and readability of your module and you should always use them


Back in the early days of Puppet managing your site specific data was a nightmare involving hard coded variables and elaborate coding patterns (a.k.a nasty hacks) that lead to non-reusable un-sharable modules, each organisation maintaining their own copy of modules to manage software. Then in 2011, Hiera was born and all that changed. The ability to maintain data separate from Puppet code gave way to a new generation of Puppet modules that could be shared and used without modifying the code and the Puppet Forge flourished as a result with dependable and maintainable modules. It wasn’t long before Hiera was incorporated officially into Puppet core and Puppet released the data binding features which automatically lookup class parameters, so by simply declaring include foo any parameters of the foo class will be automatically looked up from Hiera. This lead to a design pattern that became very popular with module authors to take advantage of the data binding features of hiera and give the module author some degree of flexibility with setting dynamic default values. The pattern is commonly known as the params pattern. In short, it works by a base class inheriting a class called params and setting all of the base classes parameter defaults to variables defined within the params class, eg:

This widely adopted pattern gives the implementor of the module the ability to override settings directly from hiera by setting foo::some_setting but also gives the module author more flexibility to dynamically set intelligent defaults. I don’t think it’s a terrible pattern, and it’s certainly the cleanest way of designing a module in Puppet 3, but it can get complicated with data nested in deep conditionals in your class. We already have a much better proven way of handling data in Puppet using Hiera, so why can’t we adopt this same approach with module defaults? This was an idea first proposed by R.I Pienaar, original author of Hiera, who went on to release a POC for it. The idea was solid and made a lot of sense, and now Puppet have adopted this approach natively in 4.3. Puppet modules can now ship with their own self-contained Hiera data in the module, which Puppet will use as a fall back if no other user-defined data is found (eg: your regular Hiera data). So using data in modules our class now looks like:

We can then enable a module-specific Hiera configuration to read the variable defaults from ./data

Class parameter defaults can now be easily read in a familiar Hiera layout

I’m really only touching on the surface of the new data in modules additions here, there are plenty more really cool features available, including using Puppet functions to provide data defaults. R.I Pienaar wrote an excellent article covering this in a bit more detail and the offical puppet documentation explains things in depth.

The main advantage of using data in modules over params.pp is the ability to store your module parameter defaults in an easy-to-read style and maintains a degree of consistency between how module data and other site data is handled and avoids the need for cumbersome in-code setting of data, such as the params.pp pattern. I love the fact I can download a third party module off the forge and easily look through all of the configured defaults in a nicely formatted YAML hierarchy rather than trawling through a mish mash of conditional logic trying to understand how a variable gets set. This feature is a major plus.


The final enhancement to the Puppet language I want to focus on in this post is iteration. The idea of iteration in Puppet was something that historically I used to be against. People have been asking for loops in Puppet since it first got released, I always felt that it wasn’t necessary and detracted from the declarative principles of Puppet. I like to think that I was half right though, most people wanted iteration because they didn’t understand the declarative aspects of Puppet and if they thought differently about their problem they would realise that 99% of the time defined resources were a perfectly fitting solution. Puppet, and how people use it, has changed since I forged those opinions. In the new age of data separation where we try and avoid hard coding site specific data inside Puppet code, that data now lives in a structured format inside Hiera and we need a way to take that data model and manage Puppet resources from it.

The current well adopted method of doing this is to use the create_resources() function in Puppet. The idea is simple enough, the function takes a resource type as an argument followed by a hash where the keys represent resource titles and the values are nested hashes containing the attributes for those resources and voila, it creates Puppet resources in your catalog

Is a more dynamic, data driven way of declaring;

I have a love/hate relationship with create_resources(). It feels like a sticking plaster designed to hammer Puppet into doing something it fundamentally wasn’t designed to do, but on the other hand, in the absence of any other solution I couldn’t have survived without it. It’s also fairly restrictive solution, in the ideal world of data separation I should be able to model my data in the best possible way to represent the data, which may look quite different from how they are represented as Puppet resources. The create_resources() pattern offers no way to filter data or to munge the structure of a hash for example. Take the following example;

If I want to model the above modified data structure as user resources with the UID’s corresponding to the value of each element, I cannot do this with create_resources() directly managing the user resources since the function expects the hash to be representative of resource titles and attributes. I could parse this with a function to munge the data first, or I could write an elaborate but messy defined resource type to do it, but neither of these seems ideal. Examples like this prove that Puppet desperately needs more, and now (some would say FINALLY) it’s arrived in the form of iterators and loops natively included in the language. Using iteration, we now have a lot more flexibility in scenarios such as this and I can easily solve the above dilema with a very simple iterator;

Despite my initial reluctance about iteration some years ago, it’s obvious now that this is a huge improvement to the language and bridges the gap between modelling data and declaring resources. I don’t however agree with some camps that say defined resources are obsolete, I think they have a genuine place in the world. If you are managing a bunch of resources from a hash it is still going to be cleaner in some circumstances to have a defined resource type to model that behaviour rather than declaring lots of resources within an iterator loop, but having iterators gives authors the ability to chose which method best achieves their objectives in the cleanest way.


I have touched on three major changes that will change how people write modules going forward, and there are plenty more great features in 4.x worthy of discussion that I will go into in future posts. But I feel the above changes are the most relevant when it comes to module design patterns and they offer a huge improvement in the module quality and functionality. In a nutshell;

  • validate functions (for basic types) are dead. Validate your parameters using native types
  • params.pp is dead. Use data in modules
  • create_resources() is dead. Use iterators
Follow and share if you liked this

Using data schemas with Jerakia 0.5


In my previous posts I introduced and talked about a new data lookup tool called Jerakia.   I also recently gave a talk at Config Management Camp in February about it.

This week saw the release of version 0.5.0.  You can view the release notes here.  In this post I am going to focus on one new feature that has been added to this release, data schemas.

What are data schemas?

In short, schemas provide a definition layer to control lookup behaviour for particular keys and namespaces.  So, what do we mean by lookup behaviour?  Jerakia is a hierarchical based lookup tool and there are a couple of different ways that it can perform a search.  The simplest form of lookup will walk through the hierarchy and return the first result found.  A Jerakia lookup may also request a cascading lookup by setting the cascade option to true.  For a cascading lookup, Jerakia will continue to walk through the hierarchy and search for all instances of of the requested key and combine them together  into a hash or an array.   When performing a cascading lookup, the requestor can select the desired merge behaviour using the merge flag of the lookup.  Supported options for this parameter are array, hash or deep_hash.   When performing a cascading hash lookup, hashes are merged and key conflicts at the first level of the hash are overwritten according to their priority, whereas deep_hash attempts to merge the hash at all nested levels.

These concepts will be familiar to Hiera users as the legacy functions hiera_array() and hiera_hash().

Why use schemas?

The ability to do hash or array type lookups is a very useful and popular feature, and has always existed in Hiera.  For a long time the problem was where should/could you declare that a particular lookup key should be looked up this way.   Initially people used the hiera_hash() and hiera_array() directly in their modules.  This has several drawbacks though.  Most notably the incompatibility with Puppets’ data binding feature meant hash and array lookups had to be dealt with separately outside of the classes parameters, and  hard coding Hiera functions within modules is not best practice as it makes assumptions that the implementor of the module is using Hiera in the first place.

Later versions of Puppet and Hiera have made this nicer.  The new lookup functionality in Puppet 4 provides a lookup() function that takes a lookup strategy as an argument.  This is certainly nicer than the legacy hiera functions as it is provider agnostic and you can swap out Hiera for a different data lookup tool, such as Jerakia, transparently, which makes it more acceptable to use this function in a module.  And there is the new data in modules feature which allows for Puppet modules to determine the lookup behaviour of the parameters it’s classes contain

I think this approach is definitely going in the right direction, and it solves the problem of overriding behaviour for parameterised classes.

Jerakia takes a new approach and provides a new layer of logic called schemas.  When Jerakia receives a lookup request for a key, it first performs a search within the schema for the key and if found, will override lookup and merge behaviour based on what is defined in the schema.

The advantage of using schemas is that a user can download a module from the forge and override the lookup behaviour of the keys without modifying any of the Puppet code or adding anything Puppet specific to the data source.

How schemas work

Controlling lookup behaviour

When a request for a lookup key is recieved by Jerakia, it first performs a lookup against the schema.   It currently uses the in built file datasource to perform a separate lookup, but the source of data read by this lookup is different to the main lookup.  By default, Jerakia will search for a JSON (or YAML) file with a name corresponding to the namespace of the request in /var/lib/jerakia/schema.  Within this JSON document it searches for the key corresponding to the lookup key requested.  The data returned can override the lookup behaviour.  For example;

The above example will override a lookup for the key sysadmins in the namespace accounts (accounts::sysadmins) to be a cascading search merging the results into an array (just like hiera_array())

The big advantage here is that this data is separated from our actual configuration data, which could be in a YAML file structure, database, REST API endpoint…etc.

Using schema aliases

Another feature of schemas is the ability to create pseudo keys and namespaces that can be looked up and mapped to other keys within the data.   Schemas have the ability to override the namespace and key part of a request on the fly.  As a very hypothetical example, let’s say you have an array of domains in your data defined as webserver::domains eg:

If you need the same data to populate the vhosts parameter of a class called apache you could simply alias this in the schema rather than declaring the data twice or performing lookups from within Puppet…

The above schema entry will mean that lookups for apache::vhosts and webserver::domains will return the same data set.

You can also use a combination of aliases and lookup overrides to declare a  psuedo key to lookup data in a different way.

Here we have created two pseudo keys, security::firewall_rules and security::all_firewall_rules, both of which alias to the firewalld::rich_rules data set but will be looked up in different ways.  The security namespace itself may not even exist in the actual data set.

Future plans for schemas

The current implementation of schemas is fairly basic.  I see this as being quite a fundamental part of Jerakia in the future and it’s an area that could see functionality such as sub-lookups, views and even light “stored procedure” type functions to add some powerful functionality to data lookups whilst keeping the actual source of data in it’s purest form thus not stifling innovation of data source back ends.

Although currently limited to searching JSON or YAML files, schema searches are actually done with Jerakia lookups, the same functionality that does a regular lookup, so it should be trivial to allow users a lot more flexibility in how schema searches are done by using custom data sources and policies in future releases.

Want to know more?

Check out the Jerakia docs for how to configure the behaviour of schemas and more on how to use them.

Next up…

Puppet 4.0 delivered some great new functionality around data lookups, including environment data providers and the internal lookup functions that I feel will go really well with Jerakia.  I’m currently working on integration examples and a new environment data provider for Jerakia that will be available soon.


Follow and share if you liked this

Extending Jerakia with lookup plugins


In my last post, I introduced Jerakia as a data lookup engine for Puppet, and other tools. We looked out how to use lookup policies to get around complex requirements and edge cases including providing different lookup hierarchies to different parts of an organisation. In this post we are going to look at extending the core functionality of Jerakia. Jerakia is very plugguble, from data sources to output filters, and I’ll cover all of them in the coming days, but today we are going to cover plugins.

Lookup plugins

Last week we looked at Jerakia polcies, which are containers for lookups. A lookup, at the very least contains a name and a datasource. A classic lookup would be;

Within a lookup we have access to both the request and all scope data sent by the requestor. Having access to read and modify these gives us a great deal of flexibility. Jerakia policies are written in Ruby DSL so there is nothing stopping you from putting any amount of functionality directly in the lookup. However, that makes for rather long and complex policies and isn’t easy to share or re-use. The recommended way therefore to add extra functionality to a lookup is to use the plugin mechanism.

As an example, let’s look at how Jerakia differs from a standard Hiera installation in terms of data structure and filesystem layout between Hieras’ YAML backend and Jerakias’ file datasource. Puppet data mappings are requested from Hiera as modulename::key and are searched from the entries in the configured hierarchy. Jerakia has the concept of a namespace and a key, and when requesting data from Puppet, the namespace is mapped to the module name initiating the request. Jerakia looks for a filename matching the namespace and the variable name as the key. Take this example;

A standard hiera filesystem would contain something like;

In Jerakia, by default this would be something like;

The difference is subtle enough, and if we wanted to use Jerakia against a Hiera-style file layout with keys formatted as module::key we could manipulate the request to add the first element of request.namespace to the key, separated by ::, and then drop the namespace completely. You could implement this directly in the lookup, but a better way is to use a plugin keeping the functionality modular and shareable. Jerakia ships with a plugin to do just this, it’s called, unsuprisingly, hiera.

Using lookup plugins

To use a plugin in a lookup it must be loaded using the :use parameter to the lookup block, eg:

If you want to use more than one plugin, the argument to :use can also be an array

Once a plugin is loaded into the lookup, it exposes it’s methods in the plugin.name namespace. For example, the hiera plugin has a method called rewrite_lookup which rewrites the lookup key and drops the namespace from the request, as described above. So to implement this functionality we would call the method using the plugin mechanism;

Writing plugins

Lookup plugins are loaded as jerakia/lookup/plugin/pluginname from the ruby load path, meaning they can be shipped as a rubygem or placed under

jerakia/lookup/plugin relative to the plugindir option in the configuration file. The boilerplate template for a plugin is formed by creating a module with a name corresponding to your plugin name in the Jerakia::Lookup::Plugin class… in reality that looks a lot simpler than it sounds

We can now define methods inside this plugin that will be exposed to our lookups in the plugin.mystuff namespace. For this example we are going to generate a dynamic hierarchy based on a top level variable role. The variable contains a colon delimited string, and starting with the deepest level construct a hierarchy to the top. For example, if the role variable is set to web::frontend::application_foo we want to generate a search hierarchy of;

To do this, we will write a method in our plugin class called role_hierarchy and then use it in our lookup. First, let’s add the method;

We can now use this within our module by loading the mystuff plugin and calling our method as plugins.mystuff.role_hierarchy. Here is the final lookup policy using our new plugin;


My example here is pretty simple, but it demonstrates the flexibility of Jerakia to create a dynamic search hierarchy. With access to the request object and the requestors scope, lookup plugins can be very powerful tools to get around the most complex of edge cases. Being able to write Jerakia policies in native Ruby DSL is great for flexibility, but runs the risk of having excessive amount of code complicating your policy files, the plugin mechanism offers a way to keep extended lookup functionality separate, and make it shareable and reusable.

Up next…

We’re still not done, Jerakia offers numerous extension points. In my next post we will look at output filters to parse the data returned by the backend data source. We will look first at what I consider the most useful of output filters, encryption which uses hiera-eyaml to decrypt data strings in your returned data, no matter what datasource is used, and look at how easy it is to write your own output filters. After that we will look at extending Jerakia to add further data sources, so stay tuned!

Follow and share if you liked this

I’m joining Puppet Labs

Back in September last year I flew over to the US for PuppetConf 2011, and more recently took a trip to Edinburgh for PuppetCamp. Both times when I returned my wife asked me how it went, and both times my answer was simple; “I’ve gotta work for these guys!” I said. So after recently returning from a short excursion to Portland, Oregon I am very thrilled and honoured to announce that I’ve accepted a full time position with Puppet Labs.

I’ve been an IT contractor for many years now, and started working with Puppet in 2008. Since then I’ve worked more and more with Puppet at numerous companies including most recently at the BBC. I’ve also become increasingly more involved in the Puppet user community at large over the past 4 years. I have a real passion for working with the product, I love the Puppet community and now I’m really looking forward to joining the company that made it all happen and being a part of all the awesome things to come.

I’ll be joining Puppet Labs later this month as a Professional Services Engineer and look forward to maintaining and building upon the many relationships I have with various Puppet users as well as engaging with a wider section of the user community in my new professional capacity too. I would like to thank the various people at Puppet Labs involved in my interviews for the opportunity, with special thanks extended to Aimee Fahey (@PuppetRecruiter) for all her assistance during the hiring process.

Follow and share if you liked this

Designing Puppet – Roles and Profiles.

Update, Feb 15th.

Since writing this post some of the concepts have become quite popular and have generated quite a lot of comments and questions in the community. I recently did I talk at Puppet Camp Stockholm on this subject and hopefully I might have explained it a bit better there than I did below :-). The slides are available here and a YouTube video will be uploaded shortly.


So you’ve installed Puppet, downloaded some forge modules, probably written a few yourself too. So, now what? You start applying those module to your nodes and you’re well on your way to super-awesomeness of automated deployments. Fast forward a year or so and your infrastructure has grown considerably in size, your organisations business requirements have become diverse and complex and your architects have designed technical solutions to solve business problems with little regard for how they might actually be implemented. They look great in the diagrams but you’ve got to fit in to Puppet. From personal experience, this often leads to a spell of fighting with square pegs and round holes, and the if statement starts becoming your go-to guy because you just can’t do it any other way. You’re probably now thinking its time to tear down what you’ve got and re-factor. Time to think about higher level design models to ease the pain.

There is a lot of very useful guidance in the community surrounding Puppet design patterns for modules, managing configurable data and class structure but I still see people struggling with tying all the components of their Puppet manifests together. This seems to me to be an issue with a lack of higher level code base design. This post tries to explain one such design model that I refer to as “Roles/Profiles” that has worked quite well for me in solving some off the more common issues encountered when your infrastructure grows in size and complexity, and as such, the requirements of good code base design become paramount.

The design model laid out here is by no means my suggestion on how people should design Puppet, it’s an example of a model that I’ve used with success before. I’ve seen many varied designs, some good and some bad, this is just one of them – I’m very interested in hearing other design models too. The point of this post is to demonstrate the benefits of adding an abstraction layer before your modules

What are we trying to solve

I’ve spent a lot of time trying to come up with what I see as the most common design flaws in Puppet code bases. One source of problems is that users spend a lot of time designing great modules, then include those modules directly to the node. This may work but when dealing with large and complex infrastructures this becomes cumbersome and you end up with a lot of node level logic in your manifests.

Consider a network consisting of multiple different server types. They will all share some common configuration, some subsets of servers will also share configuration while other configuration will be applicable only to that server type. In this very simple example we have three server types. A development webserver (www1) that requires a local mysql instance and PHP logging set to debug, a live webserver (www2) that doesn’t use a local mysql, requires memcache and has standard PHP logging, and a mail server (smtp1). If you have a flat node/module relationship with no level of abstraction then your nodes file starts to look like this:

Note: if you’re already thinking about ENC’s this will be covered later

As you can see, the networking and users modules are universal across all our boxes, Apache, Tomcat and JDK is used for all webservers, some webservers have mysql and PHP logging options vary depending on what type of webserver it is.

At this point most people try and simplify their manifests by using node inheritance. In this very simple example that might be sufficient, but it’s only workable up to a point. If you’re environment grows to hundreds or even thousands of servers, made up over 20 or 30 different types of server, some with shared attributes and subtle differences, spread out over multiple environments, you will likely end up with an unmanagable tangled web of node inheritance. Nodes also can inherit only one other node, which will be restrictive in some edge cases.

Adding higher level abstraction

One way I have found to minimise the complexity of node definitions and make handling nuances between different server types and edge case scenarios a lot easier is to add a layer (or in this case, two layers) of seperation between my nodes and the modules they end up calling. I refer to these as roles and profiles.

Consider for a moment how you would represent these servers if you weren’t writing a Puppet manifest. You wouldn’t say “www1 is a server that has mysql, tomcat, apache, PHP with debug logging, networking and users” on a high level network diagram. You would more likely say “www1 is a dev web server” so really this is all the information I want to be applying directly to my node.

So after analysing all our nodes we’ve come up with three distinct definitions of what a server can be. A development webserver, a live webserver and a mailserver. These are your server roles, they describe what the server represents in the real world. In this design model a node can only ever have one role, it cant be two things simultaneously. If your business now has an edge case for QA webservers to be the same as live servers, but incorporate some extra software for performance testing, then you’ve just defined another role, a QA Webserver.

Now we look at what a role should contain. If you were describing the role “Development webserver” you would likely say “A development webserver has a Tomcat application stack, a webserver and a local database server”. At this level we start defining profiles.

Unlike roles, which are named in a more human representation of the server function, a profile incorporates individual components to represent a logical technology stack. In the above example, the profile “Tomcat application stack” is made up of the Tomcat and JDK components, whereas the webserver profile is made up of the httpd, memcache and php components. In Puppet, these lower level components are represented by your modules.



Now our nodes definitions look a lot simpler and are representitive of their real world roles…

Roles are simply collections of profiles that provide a sensible mapping between human logic and technology logic. In this scenario our roles may look something like:

Whether or not you choose to use inherited classes in the way I have done is up to you of course, some people stay clear of inheritence completely, others over use it. Personally I think it works for the purposes of laying out roles and profiles to minimise duplication.

The profiles included above would look something like the following

In summary the “rules” surrounding my design can be simplified as;

  • A node includes one role, and one only.
  • A role includes one or more profiles to define the type of server
  • A profile includes and manages modules to define a logical technical stack
  • Modules manage resources
  • Modules should only be responsible for managing aspects of the component they are written for

Let’s just clarify what we mean by “modules”

I’ve talked about profiles and roles like they are some special case and modules being something else. In reality, all of these classes can be, and should be modularised. I make a logical distinction between the profile and role modules, and everything else (e.g.: modules that provide resources).

Other useful stuff to do with profiles.

So far I’ve demonstrated using profiles as collections of modules, but it has other uses too. As a rule of thumb, I don’t define any resources directly from roles or profiles, that is the job for my modules. However, I do realise virtualised resources and occasionally do resource chaining in profiles which can solve problems that otherwise would have meant editing modules and other functionality that doesn’t quite fit in the scope of an individual module. Adding some of this functionality at the modular level will reduce the re-usability and portability of your module.

Hypothetically lets say I have a module, let’s call it foo for originalities sake. The foo module provides a service type called foo, in my implementation I have another module called mounts that declares some mount resource types. If I want all mount resource types to be initiated before the foo service is started as without the filesystems mounted the foo service will fail. I’ll go even further and say that foo is a Forge module that I really don’t want to (and shouldn’t have to) edit, so where do I put this configuration? This is where having the profiles level of abstraction is handy. The foo module is coded perfectly, it’s the use case determined from my own technology stack that is requiring that my mount points exists before the foo service, so since my stack is defined in the profile, this is where I should specify it. e.g.:

It’s widely known that good modules are modules that you don’t need to edit. Quite often I see people reluctant to use Forge modules because their set up requires some peripheral set up or dependancies not included in the module. Modules exist to manage resources directly related to what they were written for. For example, someone may choose to edit a forge mysql module because their set up has dependancies on MMM being installed after MySQL (purely hypothetical). The mysql module is not the place to do this, mysql and mmm are separate entities and should be configured and contained within their own modules, tying the two together is something you’ve defined in your stack, so again, this is where you’re profiles come in…

This approach is also potentially helpful for those using Hiera. Although Hiera and Puppet are to become much more fused in Puppet 3.0, at the moment people writing forge modules have to make them work with Hiera or not, and people running Hiera have to edit the modules that aren’t enabled. Take a hypothetical module from the Forge called fooserver. This module exposes a paramaterized class that has an option for port, I want to source this variable from Hiera but the module doesn’t support it. I can add this functionality into the profile without the need for editing the module.

What about using an ENC?

So you’re probaby wondering why I haven’t mentioned using an ENC (External Node Classifier). The examples above don’t use any kind of ENC, but the logic behind adding a layer of separation between your nodes and your modules is still the same. You could decide to use an ENC to determine which role to include to a node, or you could build/configure an ENC to perform all the logic and return the list of components (modules) to include. I prefer using an ENC in place of nodes definitions to determine what role to include and keep the actual roles and profiles logic within Puppet. My main reason for this is that I get far greater control of things such as resource chaining, class overrides and integration with things like Hiera at the profile level and this helps overcome some tricky edge cases and complex requirements.


None of the above is set in stone, what I hope I’ve demonstrated though is that adding a layer of abstraction in your Puppet code base design can have some significant benefits that will avoid pitfalls when you start dealing with extremely complex, diverse and large scale set ups. These include

  • Reducing complexity of configuration at a node level
  • Real-world terminology of roles improves “at-a-glance” visibility of what a server does
  • Definition of logical technology stacks (profiles) gives greater flexibility for edge cases
  • Profiles provide an area to add cross-module functionality such as resource chaining
  • Modules can be granular and secular and tied together in profiles, thus reducing the need to edit modules directly
  • Reduced code duplication

I use Hiera to handle all of my environment configuration data, which I won’t go into detail about in this post. So, at a high level my Puppet design can be represented as;



As I said previously, this is not a the way to design Puppet, but an example of one such way. The purpose of this post is to explore higher level code base design for larger and more complex implementations of Puppet, I would love to hear other design models that people have used either successfully or not and what problems it solved for you (or introduced :)) so please get in touch with your own examples.

Follow and share if you liked this

Introducing hiera-mysql MySQL Backend for Hiera


Some time ago I started looking at Hiera, a configuration datastore with pluggable back ends that also plugs seamlessly into Puppet for managing variables. When I wrote hiera-gpg a few months ago I realised how easy extending Hiera was and the potential for really useful backends that can consolidate all your configuration options from a variety of systems and locations into one streamlined process that systems like Puppet and other tools can hook into. This, fuelled by a desire to learn more Ruby, lead to hiera-mysql, a MySQL Backend for Hiera.


hiera-mysql is available as a ruby gem and can be installed with:

Note: this depends on the Ruby mysql gem, so you’ll need gcc, ruby-devel and mysql-devel packages installed. Alternativley the source can be Downloaded here

MySQL database

To demonstrate hiera-mysql, here’s a simple MySQL database some sample data;

Configuring Hiera

In this example we’re going to pass the variable “env” in the scope. hiera-mysql will interpret any scope variables defined in the query option, and also has a special case for %{key}. Example:

Running Hiera

With the above example, I want to find the value of the variable colour in the scope of live

If I add more rows to the database that match the criteria, and use Hiera’s array search function by passing -a I can make Hiera return all the rows

Hiera’s pluggable nature means that you can use this back end alongside other back ends such as YAML or JSON and configure your search order accordingly.


Currently hiera-mysql will only return the first element of a row, or an array of first elements, so you can’t do things like SELECT foo,bar FROM table. I intend to introduce this feature by implementing Hiera’s hash search in a future release. Also, the module could do with slightly better exception handling around the mysql stuff. Please let me know if theres anything else that would improve it.


And of course, because Hiera is completely transparent, accessing these variables from Puppet couldn’t be easier!


  • Github homepage for hiera-mysql
  • Official Hiera Project Homepage
  • Hiera – A pluggable hierarchical data store
    Follow and share if you liked this
  • Secret variables in Puppet with Hiera and GPG

    Last week I wrote an article on Puppet configuration variables and Hiera. This almost sorted out all my configuration variable requirements, bar one; what do I do with sensitive data like database passwords, hashed user passwords…etc that I don’t want to store in my VCS repo as plaintext.

    Hiera allows you to quite easily add new backends, so I came up with hiera-gpg, a backend plugin for Hiera that will GPG decrypt a YAML file on the fly. It’s quite minimal and there is some stuff I’d like to do better – for instance it currently shells out to the GPG command, hopefully someone has some code they can contribute that’ll use the GPGME gem instead to do the encryption bit.

    Once you’re up and running with Hiera, you can get the hiera-gpg backend from Rubygems…

    We run several Puppetmasters, so for each one I create a GPG key and add the public key to a public keyring that’s kept in my VCS repo. For security reasons I maintain a dev and a live keyring so only live Puppetmasters can see live data.

    Currently hiera-gpg doesn’t support key passwords, I’ll probably add this feature in soon but it would mean having the password stored in /etc/puppet/hiera.yaml as plaintext anyway, so I don’t see that as adding much in the way of security.

    So I have my GPG secret key set up in roots homedir:

    Next I add my GPG public key to the keyring for live puppetmasters (in my set up, /etc/puppet/keyrings/live is a subversion checkout)

    Now I can create a YAML file in my hieradata folder and encrypt it for the servers in my live keyring.

    If like me you have more than one puppetmaster in your live keyrings, multiple -r entries can be specified on the command line for gpg, you should encrypt your file for all the puppet master keys that are allowed to decrypt it.

    Now you just need to tell Hiera about the GPG backend, My previous Hiera configuration now becomes:

    Here we’re telling Hiera to behave exactly as it used to when we just had the YAML back end, and if it doesn’t find the value you are requesting from YAML it will query the GPG back end which will pick up on your %{calling_module}.gpg.

    Now I can query Hiera on the command line to find my live MySQL root password with:

    In Puppet, I reference my variables in exactly the same way as any other variable

    Theres probably lots of stuff I can improve here, but I have the basics of what I need, a transparent method of data storage using GPG encryption and no sensitive data stored in my VCS repo as plain text.

    Follow and share if you liked this

    Puppet, Parameterized classes .vs. Definitions

    Firstly, a little background on this topic. During PuppetConf this year I attended a very interesting talk by Digant C. Kasundra about the Puppet implementation at Stanford University. At one point he asked, “Who uses definitions?”, and I raised my hand. The next question was, “Who uses parameterized classes?”, and I also raised my hand, this was followed by “Who uses both?”, and I was one of a small minority of people who raised a hand again. The next question was, “Who knows what the difference between a parameterized class and a definition is?”. I took a second or two to think about this and the talk moved on after no-one in the audience raised their hand, and I didn’t get a chance to answer this in the Q&A due to running out of time, but I’ve been thinking about it since. Digant’s view was that the two are very similar and he advocates the use of definitions, which is certainly not a bad point of view, but I don’t think you should be using one or the other, but rather, use either one appropriately, and given the fact that no-one could answer Digant’s question in his talk, I felt it worth expanding on the issue in a blog post and would really welcome any feedback.

    So, what is the core differences in how they are used? Well, the answer is, as Digant rightly pointed out, not a lot, but the small difference between them that does exist is very important, and should easily dictate which one you use in a given situation. Firstly, let’s look at what they actually are;

    A class, parameterized or not, is a grouping of resources (including resources provided by definitions) that can be included into a manifest with one statement. If you make your classs parameterized then you can include that class with dynamic parameters that you can override depending on how your catalog is compiled. A definition is a template that defines what effectively is no different from any other resource type and gives you a boiler plate solution for applying a series of resources in a certain way.

    So now it’s fairly clear that these two things are actually quite different, but you may be thinking, “sure, they’re different but there is nothing that a parameterized class does that you can’t do with a definition, right?” – well, yes, that’s right, but there is plenty that a definition does that you may not want it to do, mainly, puppet allowing it to be instantiated multiple times.

    As an example of this, at the BBC we have a core application, let’s call it acmeapp for semantics sake. The core application takes almost 50 different parameters for configuration and can get deployed differently depending on what type of server profile we are deploying to. We only ever want acmeapp defined once as it is responsible for some core resources such as creating the base install directories, therefore we use a parameterized class for this and sub classes can inherit of this and override any number of variables. Either way, I’m only ever going to apply the acmeapp class once in my catalog. The end result looks something like;

    Whilst the above is possible with a definition, it would allow it to be defined multiple times, which means the resources within the definition would end up duplicated and my catalog failing. I would rather Puppet fails because I’ve tried to use my class twice, rather than fail because I’ve duplicated a resource, such as the file type that creates the installation directory. It makes it very clear to anyone working with my Puppet code that this class should only be applied once.

    Now, lets look at another example. At multiple points in our manifests we need to set up MySQL grants – to initiate an Exec for every grant would be bad practice, as we’ll end up in a lot of code re-use. This is where definitions come in. At the BBC we have a mysql class that not only provides the MySQL packages and services, but also exposes some useful functions to manage your databases and grants through a series of definitions, this is the code to the definition that controls MySQL grants….

    As you can see, this is very different to our usage or a parameterized class, here we’ve templated a significant amount of functionality into one easy-to-use defined resource type. We can re-use this functionality as many times as we need, in the above example we set up two grants for the acmeapp application, one for a VIP IP address and one for an application IP range by specifying something like:

    Hopefully this gives you a good idea of the subtle differences between parameterized classes and definitions, but also that they are both very independent features of Puppet that have their own uses.

    Follow and share if you liked this