Holistic Engineering

A random assortment of shit with sprinkles.

Gem Activation and You, Part 3: System Packaging

| Comments

The short of it — packaging ruby gems is hard.

System packaging is a touchy subject to begin with; it’s arguably one of the more prominent cultural divides I’ve encountered coming to ops from a developer background. The notion of packaging something custom before it gets installed, versus installing it with an automating tool such as chef or puppet is a constant source of debate.

This is a rationale on when you should and should not package a ruby gem as a system package, and provide alternative solutions for those situations where it is not a good idea. You should probably read my first two articles before continuing down this one for some of the topics we’ll cover.

The Fundamental Problem with System Packaging

The problem is very simple: gem activation does not care about your package manager, and your package manager doesn’t care about ruby gems.

Gem activation works very similar to the dynamic linker on your system. The app itself will pick out the library it needs, based on what’s available on the system. You can see this in your packaging system: you’ll have libmysqlclient10 and libmysqlclient15, and mysql and some_other_mysql_program using one or the other. Your system picks these libraries based on what the programs are linked to. Rubygems works similarly, but your system package manager knows nothing about this. Technically it knows nothing about your dynamic linker either, but it’s been architected to handle it.

As of this writing, there is likely not a system packager that exists that can handle rubygems properly with regard to multiple concurrent gems. I’m speaking from plenty of experience here.

This is because rubygems can handle multiple versions of a gem, and your system packager only cares about one of them. You have libdbi-ruby in debian and it has one version — 0.4.0. If 0.5.0 is released, either debian releases a new package and deprecates the old one, or they don’t move. Either way, rubygems is given one option.

Ruby programs don’t work like this. The other articles go into detail as to why.

And this is all before the problem of packaging the right ruby installation for your application.

Either way, here are some scenarios you may run into while considering system packaging.

Scenario 1: Many apps with a common set of gems

This is the time to system package. However, this is rare in practice — most apps have uncommon dependencies with each other and this all needs to be accounted for, especially with regard to deep dependencies of common libraries such as the json gem. A 1.8.x versus 1.5.x is going to give one or both of those apps hives if you’ve only system packaged one of them. Strong advice: use a tool like fpm to manage this problem and test regularly.

Scenario 2: Many apps with a diverse set of gems

This is the best opportunity to use bundler and bundle exec if you can. This doesn’t always make sense however — if you need a “bare” ruby, consider omnibus-ruby if you can invest the effort. The point is, you will save yourself endless hell by completely isolating your dependencies. Rails apps definitely fit into this ballpark as well.

Scenario 3: One app with a very specific set of gems

omnibus-ruby. omnibus-ruby. omnibus-ruby.

Keep off your system ruby entirely if you can manage it. It will be cleaner, safer, and less hassle in the long run.

Packaging is cool, yo

You may have been burned by system packages in the past, or system rubies, etc. This series tries to cover the issues in detail so you can understand the fundamental problems and wield bundler, fpm and omnibus-ruby to glory while mastering the problem in a constructive, less frustrating way.

I hope it’s become a valuable resource.

Gem Activation and You, Part 2: Bundler and Bin Stubs

| Comments

Most of this article will expect some basic familiar with Bundler, and Gem Specifications, Activation, and Dependencies.

There’s been a bit of discussion in the past about Bundler, when it should be used, when it should not be used. Yehuda Katz has written extensive commentary on the topic, and you should read that, but this will discuss how Bundler works and what it’s good at solving for you.

I’m going to boil it down, and will elaborate implicitly later:

  • If you have a standalone executable for others to use, Bundler can make dependency problems less obvious by hiding dependencies which are valid according to your gemspec, but broken. You should at least ensure it’s operating without bundler before releasing it.
  • If you have a project directory, use Bundler extensively, and set everything you use in the Gemfile. Use bundle exec extensively.
    • Consequently, if you’re developing a gem, that is your project directory. Just don’t check in the Gemfile.lock, but still use bundle exec like it’s going out of style. Routinely bundle update or set arbitrary hard dependencies in your Gemfile to ensure your gem plays well with others.

Why all this?

Because even though we’ve been using Semantic Versioning long before Tom Preston-Werner wrote his treatise on the subject, you still have to play ball with a lot of people. A lot of people don’t use Semantic Versioning.

Ivory Towers are for people who never get dirty; ignore the real world at your own peril. Bundler is a tool for assisting you with dealing with the real world. Just like you have things like the CGI specification, and HTTP, Rails is there to assist by putting XSS and CSRF protection — things you need for modern web programming. RubyGems is the basics, and Bundler is the cherry on the top to assist with real world application problems.

That said, using bundler liberally can hide certain classes of problems, or empower you to discover them.

How does Bundler work?

Let’s start with a quick note on what Gem Requirements are first. So, a Gem Requirement is a specification of a version, such as >= 0, which always means the latest version, or ~> 1.2.3, which means anything >= 1.2.3 but also anything <= 1.3.0. Gem Requirements have a few operators which have basic code mappings. You should read them.

How Bundler works, in a nutshell: For a given Gemfile, Bundler will use the latest version of everything that fits the default Gem Requirement (the default requirement being >= 0), and given any conflicts, slowly reduces the value of each Gem’s version until it violates the Gemfile's Requirement or the Specification’s Requirements. Presuming it’s able to solve the formula, it spits out a Gemfile.lock which contains what conclusion it came to. If not, it tells you where the conflict lies.

Let’s see this in action

As mentioned in the previous article, both the chef and vagrant 1.0.x gems do not play nicely together on a dependency level. However, if you’re willing to accept chef 10.18.2 instead of the latest hotness, 11.4.4, you can use it with vagrant 1.0.

Here’s an example Gemfile to play with:

Gemfile.rb
1
2
gem 'chef'
gem 'vagrant', '= 1.0.7'

Put that in a directory and run bundle. You should see something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Using archive-tar-minitar (0.5.2) 
Using bunny (0.7.9) 
Using erubis (2.7.0) 
Using highline (1.6.18) 
Using json (1.5.4) 
Using mixlib-log (1.6.0) 
Using mixlib-authentication (1.3.0) 
Using mixlib-cli (1.3.0) 
Using mixlib-config (1.1.2) 
Using mixlib-shellout (1.1.0) 
Using moneta (0.6.0) 
Using net-ssh (2.2.2) 
Using net-ssh-gateway (1.1.0) 
Using net-ssh-multi (1.1) 
Using ipaddress (0.8.0) 
Using systemu (2.5.2) 
Using yajl-ruby (1.1.0) 
Using ohai (6.16.0) 
Using mime-types (1.23) 
Using rest-client (1.6.7) 
Using polyglot (0.3.3) 
Using treetop (1.4.12) 
Using uuidtools (2.1.4) 
Using chef (10.18.2) 
Using ffi (1.8.1) 
Using childprocess (0.3.9) 
Using i18n (0.6.4) 
Using log4r (1.1.10) 
Using net-scp (1.0.4) 
Using vagrant (1.0.7) 
Using bundler (1.3.5) 
Your bundle is complete!
Use `bundle show [gemname]` to see where a bundled gem is installed.

Notice how we’ve specified the latest version of chef, but we got 10.18.2 instead? This is because of the net-ssh dependencies they share — 10.18.2 depends on a version of net-ssh that vagrant is ok with, so Bundler, to solve the formula, rolls our chef back.

Change your Gemfile to look like this:

Gemfile.rb
1
2
gem 'chef', '~> 11.0'
gem 'vagrant', '= 1.0.7'

This sets chef to have a minimum version of 11.0 but not as high as 12.0. Run bundle update. You will see this:

1
2
3
4
5
6
7
8
Resolving dependencies...
Bundler could not find compatible versions for gem "net-ssh":
  In Gemfile:
    chef (~> 11.0) ruby depends on
      net-ssh (~> 2.6) ruby

    vagrant (= 1.0.7) ruby depends on
      net-ssh (2.2.2)

Voila! We have a constraint violation on net-sshchef depends on 2.6 or better, and vagrant just isn’t going to let that happen. If you read the first article, you’ll notice this is the same constraint violation we saw before.

Bin Stubs, or how those command-line tools get run.

Now that we understand how Bundler works, let’s have fun with tools like rake or thor or gist. These are tools you commonly would run outside of a bundled environment, but still have consequences within the RubyGems system.

This is because they correspond to activated gems, and what gems are activated largely depends on what gets installed. The scripts you actually run are called “bin stubs”, or little scripts that look a lot like this (this one’s for rake):

rake.binstub.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[15] erikh@speyside ~/tmp% cat `which rake`
#!/usr/bin/env ruby
#
# This file was generated by RubyGems.
#
# The application 'rake' is installed as part of a gem, and
# this file is here to facilitate running it.
#

require 'rubygems'

version = ">= 0"

if ARGV.first
  str = ARGV.first
  str = str.dup.force_encoding("BINARY") if str.respond_to? :force_encoding
  if str =~ /\A_(.*)_\z/
    version = $1
    ARGV.shift
  end
end

gem 'rake', version
load Gem.bin_path('rake', 'rake', version)

There’s that gem call again! If you notice, it’s parsing the version out from _version_, and activating that version or the latest version if omitted (because version will be nil). This is just like the example from the first article.

This means if you have rake 0.9.6 and rake 10.0.0, by default, rake 10.0.0 will be run. However, if you do this:

1
rake _0.9.6_ my_target

0.9.6 will be run instead. The point is, the script is there to facilitate this, and gem activation in general. The importance of these notions will be important for our next part…

Why bundle exec is really really really really important for your bundled projects

Type bundle gem foo — this will create a project skeleton for a gem called foo. It will generate in the foo directory a few files, including a Gemfile, a Rakefile, and a foo.gemspec.

Let’s add something to that Rakefile. How about this at the end?

Rakefile.rb
1
2
require 'json'
p JSON::VERSION

And this to the foo.gemspec in the right spot:

Rakefile.rb
1
spec.add_dependency 'json', '= 1.5.4'

Then gem install json to get the latest version, then bundle install.

If we type the command to get the list of tasks, rake -T, we should see something like this:

Rakefile.rb
1
2
3
4
"1.7.7"
rake build    # Build foo-0.0.1.gem into the pkg directory.
rake install  # Build and install foo-0.0.1.gem into system gems.
rake release  # Create tag v0.0.1 and build and push foo-0.0.1.gem to Rubygems

What? We just told bundler to use 1.5.4! Bundler never got considered here. The tool, bundle exec was created to ensure that all activations happen under the watchful eye of bundler.

Type bundle exec rake -T and see how this changes:

Rakefile.rb
1
2
3
4
"1.5.6"
rake build    # Build foo-0.0.1.gem into the pkg directory.
rake install  # Build and install foo-0.0.1.gem into system gems.
rake release  # Create tag v0.0.1 and build and push foo-0.0.1.gem to Rubygems

Now, if there were conflicting gems on your machine that you would require, or just want to make sure you have the right version, running without bundle exec ensures that’s possible. This is a great thing for one-off commandline tools, but not so great for applications, or projects in general. If you develop commandline tools, you should test with and without bundler to ensure the behavior in the presence of other dependencies is desired.

RubyGems 2.0 can use Gemfiles

Bundler can solve a whole host of constraint problems, but RubyGems 2.0 now considers Gemfiles as well; this actually made the above example a lot harder to do that it has been before as bundle exec is not nearly as necessary anymore. Still, to be on the safe side, you should use it for now.

Conclusion

Bundler, Bin Stubs and RubyGems all work together to create a smart system at the cost of a little cognitive dissonance — the expectation that there should be one source of truth is honored, but it is evaluated amongst many truths in relationship to its own requirements. When you don’t care, it’s great. When you do, you have this article to help you figure out what to do. :)

Stay tuned for Part 3, where we discuss packaging RubyGems with other packaging systems.

Gem Activation and You: Part I

| Comments

Based on recent readings and conversations, I think the process from when someone requires a file in a ruby script, and what gems are selected and more importantly, how they’re selected is a confusing topic for many.

The how is called Gem Activation and is a critical point for:

  • Requiring your libraries from gems
  • Covered in Part 2:
    • Using and understanding how Bundler works
    • Understanding how ruby scripts installed from gems (called “binstubs”) work
    • Why bundle exec is important for projects managed by Bundler, even for things you have elsewhere.
  • Will be covered in Part 3:
    • Why packaging independent gems with system packagers like rpm and dpkg may be more trouble than it’s worth, or a horrible idea. You pick your preferred caustic phrase here.

Depending on the level of interest in the first post and my schedule, I will try to cover all of these topics.

We will be covering RubyGems 2.0 as a baseline for most things — I will try and note when things differ in the 1.8 series especially, but be aware that I may miss a few things, and some of this behavior may be surprising to RubyGems 1.8 users.

In short, upgrade already.

Some Philosophy And Why This Is Important

It’s best to understand the tool you’re using. Let’s be honest — RubyGems is definitely not a popular platform for many, especially in the ops world. However, between RubyGems and Bundler, many (tens of thousands) of developers are able to not care about a lot of problems which plague developers of other systems such as Perl and Python, like multi-tenancy of a given library for different projects (we’re going to talk about this a lot!) and hosting binary builds on a platform-specific basis. Hi, Windows and Java users!

RubyGems Is Not Going Away

RubyGems runs by default since the first production release of the 1.9 series, 1.9.2. It is commonly used on 1.8.7, and obviously the default behavior has continued in the 2.0 series, just with RubyGems 2.0 to support it. You can turn it off with this command on all 1.9.2 and greater rubies:

foo.sh
1
$ ruby --no-gems my_program.rb

Which is a horrible idea — you’re basically throwing the ability to run anything on rubygems.org without a lot of manual effort. This is necessary for some projects (logstash I know has real, unresolvable problems with rubygems in jar files that surround emulated disk performance), but these problems are more the exception than the rule.

And let’s be blunt: there have been more than a few attempts to replace RubyGems. Do you remember any of them?

Knowledge Is Power

Want to know how to package your new gem? When you should use Bundler and when you shouldn’t for your next project? How to configure your dependencies? Hopefully this will give you some insight as to how you might accomplish that.

Even if you hate (or still hate) RubyGems after reading all this, you should know your tools, damnit.

Getting Started: The Basics

A Gem Specification is embedded into each gem, which is unpacked when you install a gem and put in a special place where rubygems can refer to it later.

When you start ruby, the variable $LOAD_PATH contains all the paths for requiring items in the Standard Library, rubygems being one of those things. The parts we’re discussing will already be loaded by the time your ruby starts executing your code, unless you execute with --no-gems as mentioned above.

When you require something, this would normally refer to Kernel.require and search this $LOAD_PATH.

RubyGems intervenes — you can actually see this here. What is happening is that it overrides Kernel.require with its own method which first considers RubyGems, then the standard library. If anything is found in gems which meets the required path and other activation requirements, the specification is activated and so are its dependencies. Otherwise, it fails with a LoadError which is a standard ruby exception class.

What this means is that the specific versions of the gems that are needed are added to the $LOAD_PATH. Again, this happens at require time, not boot.

Time For Some Action

A really good gem that has a fair amount of dependencies (and one I think a lot of people reading this will be using) is chef.

Go ahead and gem install chef if you don’t have it installed already. Then, start up irb and follow along (note that the first .dup statement is very important):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[22] erikh@speyside ~mine/blog% irb

irb(main):001:0> orig = $LOAD_PATH.dup
=> [ ... standard ruby stuff (a lot of it) ... ]

irb(main):002:0> require 'chef/config'
=> true

irb(main):003:0> $LOAD_PATH - orig
=> [
"/Users/erikh/.gem/ruby/2.0.0/gems/mixlib-config-1.1.2/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/mixlib-cli-1.3.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/mixlib-log-1.6.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/mixlib-authentication-1.3.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/mixlib-shellout-1.1.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/systemu-2.5.2/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/yajl-ruby-1.1.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/ipaddress-0.8.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/ohai-6.16.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/rest-client-1.6.7/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/net-ssh-2.6.7/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/net-ssh-multi-1.1/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/erubis-2.7.0/lib",
"/Users/erikh/.gem/ruby/2.0.0/gems/chef-11.4.4/lib"
]

If you view the dependencies for chef on rubygems.org, you’ll see these are the exact settings for chef version 11.4.4, which at the time of this writing is the latest version of chef. This is very important to remember, as we’ll see in the future that we can not only manipulate what “latest” means, as it has very much to do with how Bundler works, but it is also related to how activation breakage occurs.

Note that I also have multiple versions of all these gems installed:

1
2
3
4
5
6
7
[1] erikh@speyside mine/blog% gem list --local | grep net-ssh
net-ssh (2.6.7, 2.2.2)
net-ssh-gateway (1.2.0, 1.1.0)
net-ssh-multi (1.1)

[2] erikh@speyside mine/blog% gem list --local | grep chef
chef (11.4.4, 11.4.0, 10.24.4, 10.18.2)

As you can see, I have 4 versions of chef, a few versions of net-ssh, etc.

What Just Happened

RubyGems is runtime activated and recursive, and consults specifications of the gems to determine what else to activate. Barring any restriction, when a require happens, it activates the latest version on the system for the require. Otherwise, it will fall through to Kernel.require, which may exploit already existing activated gems, or standard libraries.

In our case, we required chef/config, which is registered with the chef gem, and it picked version 11.4.4 because that’s the latest version on my system.

It also activated net-ssh version 2.6.7 and other dependencies that the 11.4.4 version of chef requires. It did not require anything, just made them what will be required should the chef requires you make, or your program itself need it. Should 2.6.8 of net-ssh come out, the ~> 2.6 requirement in the 11.4.4 gem means that if it were installed, it would take precedence over 2.6.7 because it’s the latest. This can change, but we’ll talk about it more later.

If we were to require 'net/ssh' above, the require would fall through to already requiring the 2.6.7 version since a net-ssh gem has already been activated.

Therefore, for a single require at runtime:

  • The gem is located that matches the require.
  • It is activated, meaning it is added to the $LOAD_PATH.
  • Dependencies in the gem are also activated at this time, and added to the $LOAD_PATH.
  • Kernel.require is executed now that all the things in the $LOAD_PATH that needed to be activated are.
  • Further requires for activated gems fall through to Kernel.require.

RubyGems Is A Runtime System

I know I’ve said this a few times, but I cannot express it enough — at ruby startup, nothing is activated. Only when a require is executed is anything activated, and only if something is not already activated that meets the requirement.

Multiple Dependencies On The Same Gem

Occasionally you may see something similar to this this coming from your programs, usually at startup:

1
Unable to activate chef-10.24.4, because net-ssh-2.2.2 conflicts with net-ssh (~> 2.6)

This happens when two gems depend on the same thing, but there is a conflict on what version they depend on. In this case, it’s vagrant 1.0.7 and chef 10.24.4, and they depend on different versions of net-ssh (2.2.2 and ~> 2.6 respectively).

Let’s act this out with a little exercise that shows off our little love triangle here:

  • gem install vagrant -v 1.0.7
  • gem install chef -v 10.24.4
  • Start irb again:
1
2
3
4
5
6
7
8
9
[32] erikh@speyside ~mine/blog% irb

irb(main):001:0> gem 'vagrant', '1.0.7'
=> true

irb(main):002:0> gem 'chef', '10.24.4'
Gem::LoadError: Unable to activate chef-10.24.4, because net-ssh-2.2.2
                conflicts with net-ssh (~> 2.6)
   (... stack trace here ...)

So, in the rubygems toolkit there’s a Kernel-level method called gem which lets you activate things manually, which as you can see here simulates an activation that would break.

This is not quite the same gem as provided by Bundler. It however is very similar in goal, and worth remembering for later.

Note again that while we didn’t require anything, requiring does this before it requires the actual file. So we’ve just simulated the part that matters here, not the whole standard pipeline.

The way to fix this activation is to relax the requirements or match them. I’ve done this in vagrant-fixed-ssh for my own needs, but most people will likely be happier using Chef 11 and the newer Vagrant 1.1+ series.

What Do?

We’ll cover some solutions in our next article. Thanks for reading!

The Chef Resource Run Queue

| Comments

I always cringe a little when I hear the phrase “chef scripts”, largely because it’s rather incorrect and the source of much confusion from even advanced chef users. This is an especially hard notion to defuse with consumers of chef-solo because of its very non-dynamic nature. Chef’s recipes are a way of programming a queue of actions to be run, and why this matters, I hope to make apparent over the next few pages.

The Chef term for the compiled queue is the “Resource Collection”, and it ends up being processed as linear queue with some occasional state machine tomfoolery, as we’ll see in a minute here. It’s very similar to run queues in unix kernels, especially how they relate to syscalls, with the main difference that chef is both the source of the call (the compile phase), and the executor of the request (the converge phase).

And just to be really clear this is how it always has been in chef, and probably always will be, so this topic applies for those of you still stuck on chef 0.6, all the way up to those doing the latest hotness on Chef 11.

The compile and converge phase…

A very simple chef recipe:

recipe.rb
1
2
3
4
5
6
7
8
9
10
11
execute "echo he's a bad mutha..."

execute "echo shut yo mouth!" do
  only_if { ::File.exist?("/proc/mouth") }
end

execute "echo Just talkin' bout chef"

execute "echo we can dig it." do
  only_if { ::File.exist?("/proc/mouth") }
end

During the compile phase, these resources (Chef::Resource::Execute) will get arranged in a queue in order of appearance, with their default action, :run. When chef is finished compiling all the recipes (order determined by the node’s run_list), convergence happens.

During convergence, each queue item is iterated through and has it’s provider (in this case, the rather unsurprising Chef::Provider::Execute) applied to it with the action, after primitive predicates are checked — things like not_if and only_if, as we used up there. Presuming the predicates yield a true result and /proc/mouth exists, we will see four echo statements executed, and their output in the chef log.

So, more illustratively, here’s how this recipe turns into four echo statements:

  • recipe.rb is evaluated by chef-client or chef-solo
  • when each execute statement is encountered:
    • a Chef::Resource::Execute object is created.
      • this object has defaults applied to them such as the :run action and Chef::Provider::Execute provider
      • the body of the statement (the bits between do/end) are applied to the resource via some ruby evaluation magic called instance_eval
    • This object is then added to the end of the ResourceCollection’s queue.
  • after all recipes are evaluated, the compile phase has ended, and the convergence phase begins.
    • chef goes through the ResourceCollection and evaluates each resource in the queue, shifting it off as it encounters it.
      • chef determines if it can apply the provider to the resource by checking the action, and built-in predicates like not_if and only_if.
      • presuming it can apply it, it executes the action’s method in the provider, and the provider communicates back by altering the resource’s state if it did anything. This method is called updated_by_last_action and you’ll want to use it in LWRPs if you’re a good citizen.
      • at this point, any notifications or subscriptions are processed if the resource was told by the provider anything was changed.

The important parts

  • After the compile phase, the recipe no longer matters. It’s not even consulted and actually doesn’t even have to exist any more.
  • This is not a script in the traditional sense — the execution is not top to bottom, it’s two phase, and the second phase is largely responsible for what is executed, and the order it’s executed in.

Lisp vs. C macros, a digression.

You may already be familiar with how C and Lisp macros work and how they are different from each other. I’ll relate these to recipe compilation in a second, but first an explanation is needed.

Before any compiler runs, C executes cpp, or the C pre-processor, (or a derivation thereof depending on the compiler suite) to process macros. It then takes the pre-processed output and compiles that instead.

Example:

macros.c
1
2
3
4
5
6
7
8
9
#include <stdio.h>

#define PRINT_COOL_STUFF(x) printf("not %s again!", (x))

int main(int argc, char **argv)
{
  PRINT_COOL_STUFF("meatloaf");
  return(0);
}

After running through the C pre-processor:

macros-pp.c
1
2
3
4
5
6
/* HUUUUUUGE block of shit that stdio.h put here that doesn't matter */
int main(int argc, char **argv)
{
  printf("not %s again!", ("meatloaf"));
  return(0);
}

Aside, not kidding about that huge block of shit — stdio.h includes other stuff, has its own macros, sometimes even printf is a macro!

The important point though is that C always deals with the result of the C pre-processor, and no C is compiled until the C pre-processor is done. The pre-processor language is its own, non-C thing and has its own quirks and ultimately ends up being a text replacement, the result of which is compiled.

Lisp macros are different. Lisp, the language, is more or less the syntax tree — while C will be reduced to a parsed form and then manipulated, there is no parsed form of Lisp in the same sense, because you’re typing it into an editor.

Lisp macros are the manipulation of the syntax tree itself, not a text file. Lisp macros are also just lisp with some special additional syntax. They also happen at compile time, but have very different implications.

Example (forgive me for form here, it’s been a while since I seriously lisped):

macro.cl
1
2
3
4
(defmacro print_cool_stuff (x)
  (let y (concat x "is cool!"))
  '(print ,y))
(print_cool_stuff "chef")

The result is not very surprising, but how it gets there is a lot different:

macro-unwound.cl
1
(print "chef is cool!")

At compile time, it actually executed the lisp expression:

(let y (concat "chef" "is cool!"))

and yielded another lisp form with that substituted, and that is executed. No variables being passed around, no text files being edited. This is what the compiler gets to see, not us.

If we had run (print_cool_stuff (concat "three" "cool" "things")), the compiler would see this:

(print "three cool things is cool!")

That’s because this happened: (let y (concat (concat "three" "cool" "things") "is cool!"))

So, when you see a sweaty CS dork raving about lisp and how it’s the best language ever designed, this is usually what they’re excited about. Lisp macros really don’t exist in many other systems, because they almost impossible to do without bringing in the code-as-syntax-tree feature of lisp. Ruby just happens to emulate them pretty well due to some properties it has as a language.

Recipes are a DSL for adding Resources to a Queue

Just like with the lisp macro above, your recipe is compiled, and the result put into the queue. Convergence could be rewritten as “evaluating and re-compiling the queue” and not be that far from the truth, but we’ll discuss that in a minute.

Here’s an example of an abuse of this feature in recipes:

looper-recipe.rb
1
2
3
4
5
(0..10).each do |x|
  if x % 2 == 0
    execute "echo #{x}"
  end
end

What will end up existing in the queue is:

  • echo 0
  • echo 2
  • echo 4

… and so on up to 10. The odd execute resources were never created, and thus, never added.

This gets less obvious when you use a node attribute:

looper-recipe.rb
1
2
3
4
5
(0..10).each do |x|
  if x % 2 == node["echo_modulo"]
    execute "echo #{x}"
  end
end

What gets added here? It depends entirely on the value of node["echo_modulo"] at this point. While this example may seem trivial, consider something that uses a case over node["platform"] and how that might affect the queue.

Chef’s dirty secret: Convergence is also Compilation

Notifications and Subscriptions are great examples of a secondary compilation step that happens during convergence.

This recipe:

resource_notifications.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
execute "echo foo" do
  action :nothing
end

execute "echo bar" do
  action :nothing
  subscribes :run, "execute[echo foo]", :delayed
  notifies :run, "execute[echo quux]", :immediately
end

execute "echo quux" do
  action :nothing
  notifies :run, "execute[echo baz]", :immediately
end

execute "echo baz" do
  notifies :run, "execute[echo foo]", :immediately
end

Looks like this when run:

1
2
3
4
baz
foo
bar
quux

But after compile, the queue looks like this and is evaluated in this order:

1
2
3
4
foo
bar
quux
baz

What happened here? Two things:

  • An item in the queue is a constructed object with two parts:
    • The resource
    • The action
  • notifies and subscribes modify the queue, and the position is dependent on the third argument.
    • :delayed adds to the end of the queue, so it is the last thing executed.
    • :immediately adds to the head of the queue, so it the next thing executed.

The first bit there is really important — if if were not the case, the queue would have no idea that echo baz had already run, and would run it again as soon as it was notified to do so from echo bar.

Additionally, we learn here that not only is the ResourceCollection a queue of things to act upon, but a registry of resources (and their states) that can be referred to later, at convergence time, with the result being more compilation of the queue.

Ok, so what can I do with all this?

Anything you want, really. The dynamic nature of recipes is what makes them more powerful than puppet manifests or ansible playbooks; otherwise, they are not that much different.

I have a silly project I banged out over an afternoon that takes it to its logical absurd extreme: Tyler Perry’s Chefception. It’s a REST service that stores JSON blobs in a sqlite database, then uses a provider to evaluate those into resources and run them as a part of the run queue. The idea being that you could do ad hoc resource management with curl, or your favorite HTTP library, without having to care too much about cookbooks. Of course, this is probably a pretty horrible idea and shouldn’t be used by anyone for anything, but it was fun to write.

There’s a lot you can do, but just remember that a recipe and script are two very different things!

Chef-Workflow 0.2.0 Released

| Comments

I’m pleased to announce the next major release of Chef-Workflow, 0.2.0.

Chef-Workflow is a toolkit for unifying your infrastructure management with Chef. It aims to provide the tools you need to define your own workflow and testing needs with powerful, sensible defaults, with minimal dogma. The system is built as an “advanced tool for people with advanced needs”, giving you what you need to coordinate an operations team around a set of in-house formal practices.

It’s split into 2 major components with a unifying toolkit library.

  • Chef-Workflow Tasklib is a suite of rake tasks and supporting rake toolkit to do common things you’d need to do with a chef server. The tasks are malleable so you can compose your own method of working within your team, and bring in optional included tasks to enhance your workflow, already integrated with the system.

  • Chef-Workflow Testlib is an integration testing system that fully orchestrates a network of chef managed machines and allows you to test interactions between built systems.

All components are in our github organization, and it is strongly recommended you read the documentation on the wiki.

Changes

Version 0.2.0 brings several improvements. If you want extended detail, changelogs have been added to each of the repositories.

  • Copious documentation and an example repository for those interested in getting their feet wet.
  • State management has been rebuilt from scratch to improve reliability and minimize code complexity. If you’re using the system currently, you will want to clean your testing environment with bundle exec rake chef:clean before migrating.
  • Writing tests is now provisioner independent. Machine provisioning is fully controlled by a configure_general option called machine_provisioner, and all integration tests now inherit from MiniTest::Unit::ProvisionedTestCase. The existing classes are still there, but will disappear at some point.
  • New tasks chef:build and chef:converge allow you to create and interact with one-off machines, which are fully integrated into the state management system (clean them up with a single command, depend on them in tests, etc).
  • chef:info is a task namespace for interrogating the state database and configuration.
  • More refactors than a hackathon full of Grady Booch clones.

Problems

Unfortunately, Chef 11 support is not available yet. 0.2.0 was nearly two months in the making and Chef 11 support was not as critical when these changes were started.

Additionally, support for Chef 10.20+ is not available. It will not bundle with other dependencies, unfortunately, and that makes it impossible to use — there is no technical limitation beyond that. Similar projects are having similar issues related to conflicting dependencies, and I don’t think it’s worth the drama to detail them fully. What I will say is that I spent most of yesterday trying to resolve them before I released, and didn’t succeed.

What I can say with a high degree of confidence is that resolving both of these issues are first-class roadmap milestones for 0.3, which I don’t think will take terribly long.

Additionally, if you are looking for RHEL/CentOS support, you may need to wait a week or so, but patches supplied to Fletcher Nichol’s knife-server project (which is what we use to build chef servers) are in the pipe, and we’re shaking out any final bugs before we release it. Huge props to Fletcher and Dan Ryan for making this a reality.

The Future

Version 0.3.0 will primarily continue to build on code infrastructure and making the underpinnings even more rock solid. Chef 11 support is a priority and making provisioning lightning fast, rock solid and easily to extend is a principal goal.

Feedback

Tell me what you love (or hate) about the system, documentation, etc. I really do want to hear what you have to say. Github issues works, comments here work, or you can email me.

Integration Test Your Systems With Chef Workflow

| Comments

At the Opscode Summit in October, there was a lot of talk of testing. One thing I wanted to discuss at length was the concept of Integration Testing. A lot of energy and words have been spent on the testing of individual components, or Unit Testing which is great — these conversations needed to happen. Another discussion has been about Acceptance Testing and its relation to Integration Testing.

Altogether these were healthy discussions but I do feel that quite a bit of focus was spent on “omg! testing!” without actually expending focus on “what is testing accomplishing for us?” There’s a lot of words out there on what kind of testing is useful and what is not.

These things were briefly discussed, I don’t think it was adequately explored; after all, as Chef users we’re happy to have anything right now, and as tools mature and our expectations refine I think we’ll have a better idea of what fits better in the general case. That’s not a knock on any tool, anyone or any thing. I’d say as a group we’re a lot better off than other groups of a similar nature. I just think more exploration, especially with regards to and how we view testing, is important.

I started working on a project at the time which implemented a workflow (which will be the subject of another post), and integration testing system. You can find all the products here, but this article is largely about why integration testing is more important than we give it credit for.

I like to know why I’m picking my tools and what problems they solve. Therefore, I’m going to spend a little time explaining how I see these three testing methods and how they relate to operations, and then go over some current solutions and find the holes in them.

Unit Testing

The nice thing about Chef is that, due to its framework-y nature, we have our units spelled out for us already. Namely, the cookbook. Test Kitchen is great for this in the Open Source context — it’s designed from the start to run your tests on numerous platforms, which is exactly what you want when writing cookbooks you plan for others outside your organization to use. This is pretty much a solved problem thanks to that. For your internal organization, things like minitest-chef-handler go a long way towards helping you test cookbooks.

Acceptance Testing

Acceptance Testing is asking the question “does this work?” from an end-user perspective. I view this as the equivalent of your superiors, say the head of the company or maybe your division, asking that question. It’s an awfully important question to ask, which is why there are so many solutions for it already.

Nagios is an Acceptance Testing system, as is Sensu and Zabbix and other monitoring systems. They ask this question sometimes hundreds of times a minute. From an operations perspective, “does this work?” is functionally equivalent to “is it running?” — acceptance testing outside of that is probably best left to the people who developed your software you administer.

Integration Testing

So, Unit Testing is the cookbooks and Acceptance Testing covers the externals of what you maintain. What’s left?

Here’s a few things you might have done with Chef and watched blow up in your face:

  • Configured any kind of replication with chef, really any kind of replication at all
  • Made assumptions about how networked machines interoperate with each other in the wild
  • Made assumptions about behavior of machines that are working properly interact with machines that aren’t working properly

What do all these things have in common? They all are things we might automate with Chef, but they are things that are not necessarily external and they certainly aren’t functions of the unit. In the real world, nobody runs a single recipe on a machine.

The Real World

In the real world, we:

  • Run multiple recipes on a single server to configure it cohesively
  • Expect the functions of a server to play nicely with other servers on the network
  • Expect the network to be a composite of servers that work together to provide a set of services
  • Expect the services to work

All I’m really saying in this entire article, is that #2 and #3 above aren’t really accounted for.

Unit Testing isn’t always a solution

Unit Testing is great and solves some real problems, but unfortunately we spend too little time on determining what problems Unit Testing solves that Chef doesn’t solve already. Chef, by its very nature as a configuration enforcement framework, is really just a big old fat test suite for server installs.

Consider this example. While you may think it’s contrived, more than a few Unit Tests in the wild do things like this — in fact, I assert most of them do at some level.

Here’s an example of a file resource being created:

recipe.rb
1
2
3
4
5
file "/tmp/some_file" do
  content "woot"
  user "erikh"
  group "erikh"
end

And here’s our unit test:

test.rb
1
2
3
def test_file
  assert(File.exist?("/tmp/some_file"))
end

What happens if Chef can’t create /tmp/some_file? I think we all know — ironically enough, Chef aborts. The test suite itself never actually runs because Chef didn’t finish.

This is duplicating effort and I really, really, really, really hate duplicating effort. Computers are supposed to save time, not waste it. And if this were a shell script that ran and (set -e aside) then we ran this test suite, it might make sense. But we’re using Chef, which actually tests that it succeeded and explodes if it doesn’t.

It’s also why I think it’s great for open source cookbooks — it allows me, the end-user to assert that you, the open source author, at least made some attempt to define what your cookbook should do. When something changes in your cookbook, related to patches you get or changes you’ve made, you’re aware that you’ve broken the contract and I can be a responsible person and verify you have done so before deploying the latest and greatest and dealing with consequences.

For example: what happens if the contents of the file aren’t woot? Well, chef overwrites it with woot. Determining whether or not you intended that to change is probably a better use of your time.

However, is it good for internal systems? I’m not as sure it’s so useful. We’re not bound by responsibility to anyone but our stakeholders, and the fact is that most of our stakeholders in internal systems don’t really care about how the systems are set up, because that’s why they hired us in the first place. Changing things is what we do, and changing things in two places to satisfy some notion of “testability” when in fact, the only thing we’re accomplishing is to ensure Chef is doing what it says on the label is probably not a full solution.

To put it another way, in a system where change is free and constant, unit tests have less purpose with an already validating system like Chef to support it. Either your shit was set up properly, or it wasn’t, and you can actually do almost all of that in a Chef run without writing any additional unit tests. I am just suggesting the usefulness of it is diminished is all, not that it’s a bad idea.

Running everything on one box isn’t a solution

Other testing tools have you doing the equivalent of integration testing by converging tons of cookbooks on a single system and then running a test suite against the result. This just doesn’t reflect reality. Unless your entire network is a single server in a closet somewhere, your reality never, ever, ever, works like this.

I think it’s safe to say that the reason most people start using Chef in the first place is because they have more than one server to manage. There are tons of problems you can find just by testing network interoperation that aren’t even possible to test this way, or if possible, require severe convolution (like moving a standard service to a different port so you can use two of them on the same box) which doesn’t reflect reality either.

While this is a callous way to describe it, I’m going to call this “Bullshit Testing”, because in many ways you’re testing bullshit you’d never do in production, and bullshitting yourself with the results you get back from your test suite. Bullshit simultaneously does not solve the problem and largely exists to instill confidence, as a Princeton Professor once said. Good book, btw.

Integration Testing finds lots of things

And here are some concrete examples! These are all actual bugs I fixed while developing the testing suite, that I don’t think would have been possible to find with the systems that already exist. Our monitoring system would have found all of these though, long after they had become a problem.

Our BIND cookbook was misconfiguring the slaves’ ability to accept updates from the master, which accepts most of its updates from nsupdate and similar tooling. A test which checks both the master and the slave for the DNS record failed, exposing this issue.

Our collectd tooling was using a very chatty plugin that sent metrics to graphite, which resulted in the creation thousands of whisper databases over very short periods of time. The cookbooks that we were using for graphite throttled the whisper database creation to 50/minute. A test which verified that our collectd tooling was getting the metrics to graphite exposed the issue. After some debugging, and realizing the data was getting there, it was taking upwards of 15 minutes to get out of the cache and on to disk; unacceptable for an environment that uses autoscaling. Disabling the plugin (which wasn’t necessary for us) got everything working acceptably again.

The same collectd tooling came with a python graphite shipper — when the graphite server was unavailable in certain scenarios, the python plugin would spin endlessly and write its inability to connect to syslog. Our syslog tooling writes both to the server and remote centralized stores — which would have meant that while we were fixing a broken graphite server, there’s a high likelihood we would have ended up filling the disk on pretty much every machine on the network. A test that broke determining whether or not syslog was working exposed the issue, because the oncoming flood of messages left the receiver so far behind it didn’t write the message to disk in time for the test to check it.

I don’t know about you folks, but any time I can find something that’s going to break and find out before my monitoring system tells me is a win.

Did you like these ideas?

If you did, there’s great news! I’m working on a workflow and testing system which, while beta quality at the moment, attempts to meet your needs. If you don’t like the system, come help me make it better. If you don’t like how I’m doing it, or just don’t like me, but see the value of this kind of testing, please do something! We need more alternatives and approaches to this problem.

Anyhow, chef-workflow is here. It’s big, it’s vast, and it’s an advanced tool for people with advanced needs. I’ll be writing more articles about different aspects of the system as I get time, so watch this space.

Stop Using Bourne Shell

| Comments

Specifically, for scripts. Feel free to use the shell all you want for your command-line goodness.

In this article I’m going to assert that not only are shell scripts harder to read and write (which is not very hard to prove), but that they don’t really win you anything in the performance department, either. Frankly, this is a side-effect of more powerful machines, and more complicated shells and shell subsystems than anything; don’t throw your interactive bash functions away just yet. Similar criticisms could be made about make, but that’s a rant that’s been driven into the ground already.

And I’m just going to get it out right now, for those of you reading that still think a pair of suspenders and a giant beard are still in fashion: C Shell doesn’t solve the problem either.

Been Dazed and Confused For So Long

Some of the earliest criticisms I ran into about the bourne shell was from the Unix Hater’s Handbook. There was no shortage of examples, which I strongly recommend you read if you haven’t.

Bringing in our own examples, let’s start small. Here’s a great way to copy foo to bar while being completely confused as to how you managed to do so:

cp.sh
1
2
#!/bin/sh
cat bar foo >bar

Which works because bar is overwritten (to a zero-size file) before cat executes, which then concatenates both files, bar and foo, but because bar is now empty, only the contents of foo end up in stdout, which then get sent to bar. Voila, cp!

Let’s do something more “complex”. Maybe you’d like to take all the files in a directory, loop over them and do something magical:

loop.sh
1
2
3
4
5
6
#!/bin/sh

for i in $(find . -type f)
do
  cat $i
done

And then someone does this in your directory:

trollface.sh
1
2
3
#!/bin/sh

echo "HA HA" >"my cat's breath smells like cat food"

And suddenly your script breaks, but does so silently. set -e to the rescue, but we haven’t really solved the problem. find with -exec would work here, which is fine until we need to do something more than run one command, which we could solve with shell functions in a completely, wonderfully indirect and ugly way. Let’s do something “straightforward” that lets us do more than one thing.

director_of_shell_engineering_at_bell_labs.sh
1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/sh
set -e

filelist=$(find . -type f)

while [ "$filelist" != "" ]
do
  line=$(echo "$filelist" | head -1)
  filelist=$(echo "$filelist" | tail +2)
  ls "$line"
  cat "$line"
done

True story, I spent about 10 minutes getting this code right in a test directory. And the great thing is now, depending on the system you’re on, this doesn’t work either. Why, might you ask?

Let’s talk about what #!/bin/sh means to different systems. First off, be forewarned that most systems today use bash as the replacement for /bin/sh. Most of the stuff in here are builtins in bash, which means they run directly in the interpreter. In Bourne Shell, though, things like echo and [ (yes, look in your /bin directory) are not, and actually are separate binaries on your system that run in subshells.

Why does this matter? Two major reasons; one, you are writing against bash, which we’ll get into why that’s not really solving problems later, and two, if you run into a proper Bourne Shell with echo and [ invoked as real programs, those quotes in the script work completely differently.

To understand this, let’s look at what happens in this line:

iterators_in_sh_are_hard_lets_go_shopping.sh
1
line=$(echo "$filelist" | head -1)

First, $() is equivalent to backticks which markdown is unfortunately completely failing at letting me express. So, the innards of it get evaluated first. These variables are turned into meaningful content and sent to a subshell that runs echo, do not pass go, do not collect $200. This is partially why the quotes are there; if we stripped them, multiple lines would get represented as multiple arguments as the variable is interpolated into a string of content.

The real fun is in this line though:

omg_a_real_shell_mom.sh
1
while [ "$filelist" != "" ]

What happens here? Can you tell why it’s broken?

Here, I’ll give you a hint: is this a syntax error?

yep.sh
1
[ != ]

Obviously so, since there are no arguments on either side of the conditional. This happens in the plain case when $filelist has content (the right side is still missing) and the case where $filelist is empty, which is the predicate for terminating the while loop (both sides are empty). if statements and other bourne shell reserved words have the same problem.

This is because [ (which is a synonym for test, which is where the manual lives) is a command invoked in a subshell, but the whole command is evaluated first, then passed to /bin/[.

Feel free to try this yourself: /bin/[ "" != "" ] This way you don’t hit the builtin, but your shell still handles the variable/quote system. You may need to escape the [ and ] if you’re on zsh.

Here’s the solution to the problem, which you might have encountered in scripts (it’s a fairly common pattern) that need to be portable:

fixed_yo.sh
1
while [ "x$filelist" != "x" ]

Why does this matter?

Remember folks, we just walked a list of files in an error-free and portable way. That’s it. We could talk about how globs are going to save the world and how every globbing system works differently on every shell and even between versions of shells or settings within the shell. Or we could talk about how convoluted this gets with globs and emulating additional parameters to find that find content based on predicates. We could talk about how using filenames with spaces in them is retarded and you should feel bad for letting them on to your filesystem. We could talk about this all day and probably for the rest of our lives, but the reality is I’m about to show you an easier way.

Here’s an example in ruby:

better.rb
1
2
3
#!ruby

Dir["**/*"].each { |file| system(%w[ls #{file}]) if File.file?(file) }

Here’s an example in perl:

better.pl
1
2
3
4
5
6
#!perl
use File::Find;

find(\&wanted, ".")

sub wanted { system(qw(ls $_)) if -f $_ }

Both examples are not very great, to be frank, but have none of the problems your shell script will. They will not break on files with spaces in them. They’re also shorter, and arguably easier to read; they even don’t try to reinvent ls, although both are more than capable of doing so, and there are lots of good reasons to do it the ruby/perl way than calling system. They also work portably; it’s trivial to adjust these programs to run on Windows, for example, if you want. Good luck doing that in bash.

BERT ERMAGHERD PERL AND RERBY ER SER SLER AND YERS TER MUCH MERMERY

Let’s examine that for a minute. I’ve been in this profession long enough to remember working on servers where the mere suggestion of running all your init scripts against bash would probably put you in a position where you were considering touching up that old resume, “just in case”. bash is now the du jour shell on most systems and is frequently enough used for the init system, which is where you’ll find 99% of this argument lives. It does solve some of the aforementioned problems with more traditional bourne shell. bash has arrays, locals, even extended predicates for the builtin test and a million other things that really don’t make shell any easier to read, but certainly more functional. In fact, I’d posit it makes the problem worse, but going through bash and zsh’s feature set is beyond the scope of this article.

But really, and this is the real meat-and-potatoes of this argument, bash scripts don’t solve a problem in our modern environment that other tools don’t solve better for an acceptable trade of resources.

irb, which is close to a realistic baseline for a ruby program, comes in at about 10k resident. perl -de1 comes in at around 5k resident on my machine. python with enough to run os.system() comes in at 4.5k. bash comes in at 1.2k. This machine has 16GB of ram and fits on my lap. Any server that has memory issues these days is not going to have it because its supporting init tools were written in a non-shell language.

To address runtime speed with extreme hyperbole, this argument doesn’t even hold water if you were booting services on your phone. Your phone, if it was made in the last year, probably has multiple cores and a clock speed of greater than 1Ghz per core. It’ll deal more than just fine with some small perl scripts that launch programs.

As for support, go on, find me a popular linux distribution that doesn’t have a copy of perl or python installed already. Don’t worry, I’ll wait. Emphasis on popular, as anyone can build a useless linux distribution.

Statically linked /bin/sh is somewhat hard to find by default these days as well, so that argument can go do something to itself too. Besides, there’s nothing keeping you from statically linking a modern scripting language.

To put it another way, shell for init at this point is the technical embodiment of the peter principle. The level of effort (go on, look if you want) to support shell as an init system is staggeringly complicated on any unix system these days, and hard to read and modify. Tools like FreeBSD’s rc.subr and ports are notably gory mixes of shell and make that force even the most hardened shell programmer to run away screaming in terror; and this is largely evidenced by all the modern supporting tools being written in something that’s not make and not shell. Linux systems are no better, look into sysconfig and debian’s init orchestration and you’ll find that it’s not that much different, and we’ve not even started into how the init scripts themselves look. Tools like autoconf are finally seeing real competition, where a key feature is abstracting the shell away from the end user.

The existing solutions are so complicated largely because these systems are compensating from limitations in shell; for example, a simple export can completely alter the way your shell program works, whether or not it’s actually expecting that to happen. Getting quoting right is a notorious pain in the ass. Parsing is a pain in the ass. This shows in supporting tooling too: tools like /usr/bin/stat exist solely for the purpose of driving shell scripts and are considerably harder to use than the calls they emulate. This is long before we get into the portability of any of this, as our aforementioned savior find is one of the biggest offenders when it comes to cross-unix feature sets. Don’t even get me started on getopt.

So, in conclusion, unix peoples: build a better system. I don’t care if it’s in perl, python, ruby, lua… I mean, I just want to solve problems and I’m sure someone else has the time and interest in these arguments. I’m just tired of building “works for me” tools, or portable ones that are fragile and incomprehensible, simply because of the language choice that was forced upon me to remain compliant with the system. As configuration management starts taking a more prominent seat in driving how we make decisions on how to build systems, having a portable base will get more and more important, and shell just has risen to its level of incompetence.

Your Ops Team Is Probably Important

| Comments

I’m not exactly the voice of reason most of the time, and I’m not a big fan of blogs that pride opinions over technical content. That said, I see quite a bit of bickering, from developers and operations folk alike, about the concept of “NoOps” and its good and bad qualities. As someone who did full-time web development for over a decade and is now just over 2 years of a full transition into operations, I see a lot of misguided accusations over the expectations that both devs and ops teams put on each other respectively. It’s my hope to bridge those perceptions in this article, based on things I’ve discovered by contrast myself, and following the changes that have occurred in recent years.

It’s ok to not have an Ops Team.

If you can get by without an ops team — good for you. It’s not a requirement for a lot of modern applications, especially at small scale, and the service offerings out there are maturing at a breakneck pace; Heroku, Google App Engine, Engine Yard, Joyent are all offering compelling alternatives to having devoted operations personnel. To put it another way, if you all work at home, is there really a need for an office janitor? No.

It’s important to understand, though, what an operations team offers you. My experience in development, especially for the web, is that you have a product team which is looking at the road ahead and the development team is typically concerned with the here and now, save the occasional large refactor or requirement thrown downstream that requires a long-term vision of how the application(s) will be structured.

Your ops team is constantly bombarded by the present via pages, service requests from other staff, and so forth. But between those pockets of (let’s be honest here) interruptions your ops team should be fully functioning as a lighthouse for the shore of the future, planning ahead and actively seeking out ways to keep your environment reliable and convenient to work in.

“NoOps” services can offer a great deal of this these days, from taking the thinking out of capacity planning and disaster recovery, to automating the provisioning of servers and so forth. It’ll be exciting to see where they take things in the future. That said, it’s not a complete solution, and removes the human element from the equation.

Your ops team provides a greater sum than answering pages and configuring servers.

Your ops people are as much of a human resource as they are technical ones. Remember, you the developer are their customers, and the services they build to make your lives easier are their product. Keeping you happy is as much a part of their job as making sure the load balancer is distributing traffic evenly.

What frequently baffles me in this new era of ops is how many developers aren’t exploiting this resource. With rare exception, here are a few technical things your ops person will probably be able to help you with:

  • Understanding the penalties and advantages of applications that fork
  • Understanding why that query that takes 0.1ms to run and returns 800,000 rows is still slow
  • Telling you that no, that extremely complicated thing you did that eats tons of resources won’t actually be a problem.

These are all things that you are very unlikely to know as a developer. I’m not saying this as hyperbole, I’m saying this from experience on both ends of this spectrum. They’re just not problems that most developers consider (or in the last case, problems that developers think is a problem but actually isn’t) and your operations team lives in that world, around your systems, data, and network, so why not utilize them?

The reality is, what everyone sees as “Ops” is changing. Rapidly.

To further this point I think it’s important to understand that both the DevOps movement, and NoOps too (even though a lot of operations people are unwilling to admit it) is changing how we do operations, moving it forward every day.

I started as a wet-behind-the-ears 19 year old as a web developer at a famous bookstore in Portland, OR. I’m 34 now. In that time I’ve seen the landscape of operations attract a lot of archetypes, and the systems have changed too. For example, I was asked to not run “bash” on the mail server at said bookstore because it used too much ram, and to use “csh” or “sh” instead. I was introduced to this thing called “ssh”, and I had been using “telnet” for years, and it took a while to understand why I had to use something different. Nowadays people get pissed when faced with a traditional bourne shell, and nobody — nobody — would question the use of ssh. I worked with CCNAs and System Administrators who, by all accounts, could easily do my job in a heartbeat if it was at all interesting to them.

Old man rambling aside, at some point that changed. I started walking into shops where the “system administrator” was more skilled with a crimper than a shell prompt, and where I genuinely felt if I didn’t tell this person what the hell to do we were genuinely screwed. I know I’m not alone in this feeling. Having to have the “why do we have to upgrade $x again, it’s going to take me a week to make packages of this shit / the OS vendor doesn’t support it, fuck off” argument, or getting the rare opportunity to see how something is set up to see, if they didn’t just paste it straight from Google, a complex sea of shell scripts that made the most criticized contributors to DailyWTF seem tame by comparison. I specifically remember a qmail configuration done so poorly and haphazard that nobody in the company — even the sysadmin that set it up — had the cajones to reboot the machine. It wasn’t an uptime thing, they genuinely didn’t think they could bring it back up.

I hated these people. I still hate them, and I’m guessing if you’re a developer with any long-term experience, you’ve probably run into these people and despise them too. Control freaks, “BOFHs” and generally compensating for incomptence through laziness and hand-waving.

The thing that DevOps is preaching is the antithesis of this — supporting developers to realize their needs on the systems, and architecting beautiful systems, networks, and automation to deliver on promises to the company and the engineering teams. NoOps, in my opinion, is the guerilla warfare version of DevOps, namely “I never want to work with a shitbird like that again, so I’m going to write software to render them extinct.”

Both camps have been largely successful. I’ve recently started interviewing again, and the bar 3, 4 years ago was at my feet in comparison to the questions I’m being asked now. I can’t see people that held that attitude, and level of technical prowess finding work anywhere in this modern climate. What’s funny is that I firmly believe that those folks at the start of my career were really just what we’re getting back to now, after a long september of shitbirds ruining the party for everyone else. Now that these types aren’t needed anymore, the real fun can happen for everyone.

Perhaps we should just can this whole argument and call it NoShitBirds, and instead of focusing on roles, worry about what the non-shitbirds are adding to the company.

The Best Ruby Debugger: Kernel.p!

| Comments

So there’s a lot of options out there for debugging, but a lot of people miss the simplest one, and that’s p. p is a tool for inspecting the contents of objects and outputting them to the standard output. This article goes into how to use p, when to use p, and how p works (so you can make it work with your own stuff!).

In short, p is frequently equal to or better than most of your silly debuggers.

When Do I Use p?

Anytime you do not know what’s happening to your objects.

  • Method call on the fritz? Use p on the receiver. Maybe you’re working with a different object that’s a duck facsimile, behaving similarly but not completely the same!
  • Don’t know what’s in that string causing the encoding error? Use p to see the literal byte stream escaped in all its octal glory.
  • Want to see the instance variables on your ActiveRecord models? Rails supports this (and we’ll see how in a bit) with p!

How Do I Use p?

Simple! You just supply p before your desired result to inspect:

p1.rb
1
2
3
4
5
6
7
8
p obj
# outputs "#<Object:0x007fcce3848cc8>"
p obj.method # return value!
# same
p "here's a literal string"
# outputs "here's a literal string" (including the quotes)
p 1
# 1 (yes, that's what it outputs)

How Does It Work?

A little call called Object#inspect is used to return the basic representation which we can see with the “Object” line in the previous code block.

You can call it directly to get a string of the literal implementation. p is actually implemented something like this:

p2.rb
1
2
3
4
5
module Kernel
  def self.p(obj)
    puts obj.inspect
  end
end

You can define this in your own classes!

For example:

p3.rb
1
2
3
4
5
6
7
8
class Foo
  def inspect
    "#{self.object_id} - this is pretty awesome, aye"
  end
end

f = Foo.new
p f # outputs: "123123123 - this is pretty awesome, aye"

Pretty neat, huh? Speaking of pretty…

What If I Have a Big-Assed Array and/or Hash and Want to Get a Pretty-Printed Version?

Now, folks, that is a heading.

Anyhow, there is another call, just as simple! It’s called pp and its inspect analogue is called pretty_inspect. However, it is not built into ruby directly, but it is in the standard library:

p4.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
require 'pp' # this is important!

p %w[one two three]

# [ "one", "two", "three" ]

pp %w[one two three]

# [
#  "one",
#  "two",
#  "three"
# ]

pp (1..300).to_a

# 302 lines of complete awesome, and too much to print here.

Happy Debugging!

Login Accounting Explained

| Comments

I put a call out for blog ideas: Devin Austin came up with “Login Accounting”. So let’s talk about that for a bit.

First off, a little disclaimer

This is not security advice, I am not a security expert, or an expert at pretty much anything. Use your head, silly.

Let’s also be clear about something

Login Accounting is pretty much a mess on unix. Programs that manage logins can “opt-in” to login accounting; the system does not inherently do this for you, largely by side effect of why it works at all. Most tools can even be configured to write to the basic accounting systems or not, or provide the option as a runtime argument. This means that your login accounting system can lie. Additionally, the systems we’re going to look at are the first thing an intruder will mess with. We’ll look at a few techniques to mitigate the lack of information later, but rest assured there’s not much you can do to make this bulletproof.

All code examples in this article expect Ubuntu 11.10 to be the platform. You will see deviation between systems so be certain you’ve absorbed this article before trying anywhere else.

utmp, wtmp, lastlog

These are the core systems in unix login accounting; they are append-only databases, more or less, with a system-dependent structure. You can usually read about the structure by typing man utmp or reading the /usr/include/utmp.h file. Note this will be dramatically different between Linux, FreeBSD, Mac OS X, etc.

One can navigate the utmp structure pretty simply, or use the w, who and last commands to navigate them. They exist as three files:

  • /var/run/utmp is what’s currently going on.
  • /var/log/wtmp is what’s happened in the past.
  • /var/log/lastlog is the last account for each event (e.g., a specific user logging in)

Anyhow, let’s have some fun. As for navigating the structure, while system-dependent that’s really easy. Here’s a small program that navigates utmp and sends the pty and username to figlet for output for all user-related information. apt-get install build-essential figlet and then gcc -std=c99 -o fig_utmp fig_utmp.c to use.

fig_utmp.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <utmp.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  char buf[1024];
  struct utmp ut;

  int fd = open("/run/utmp", 'r');

  if (fd < 0) {
    perror("could not open file");
    exit(1);
  }

  while(read(fd, &ut, sizeof(struct utmp)) == sizeof(struct utmp)) {
    if(ut.ut_type == USER_PROCESS) {
      snprintf(buf, 1024, "figlet %s - %s", ut.ut_line, ut.ut_user);
      system(buf);
    }
  }

  close(fd);
  return(0);
}

It outputs something like this (I’m holding two logins to the box):

       _          ___                     _ _    _     
 _ __ | |_ ___   / / |           ___ _ __(_) | _| |__  
| '_ \| __/ __| / /| |  _____   / _ \ '__| | |/ / '_ \ 
| |_) | |_\__ \/ / | | |_____| |  __/ |  | |   <| | | |
| .__/ \__|___/_/  |_|          \___|_|  |_|_|\_\_| |_|
|_|                                                    
       _          _____                      _ _    _     
 _ __ | |_ ___   / / _ \            ___ _ __(_) | _| |__  
| '_ \| __/ __| / / | | |  _____   / _ \ '__| | |/ / '_ \ 
| |_) | |_\__ \/ /| |_| | |_____| |  __/ |  | |   <| | | |
| .__/ \__|___/_/  \___/           \___|_|  |_|_|\_\_| |_|
|_|                

Try modifying it to output the hostname as well! (Hint: the struct member is called ut_host)

utmp carries a lot more than just user logins though, it’s responsible for recording most of the events that happen at a system level. For example, here’s some last output:

erikh@utmptest:~$ last
erikh    pts/0        speyside.local   Sun Mar 25 10:18 - 10:18  (00:00)    
erikh    pts/0        speyside.local   Sun Mar 25 09:17 - 10:06  (00:48)    
reboot   system boot  3.0.0-16-server  Sun Mar 25 09:13 - 10:22  (01:08)    
erikh    pts/1        speyside.local   Sun Mar 25 09:13 - crash  (00:00)    
erikh    pts/0        speyside.local   Sun Mar 25 09:12 - down   (00:00)    
reboot   system boot  3.0.0-16-server  Sun Mar 25 09:11 - 09:13  (00:01)    
reboot   system boot  3.0.0-16-server  Thu Mar 22 00:35 - 00:53  (00:18)    
reboot   system boot  3.0.0-16-server  Thu Mar 22 00:32 - 00:33  (00:01)    
erikh    pts/0        speyside.local   Thu Mar 22 00:31 - down   (00:00)    
erikh    tty1                          Thu Mar 22 00:27 - down   (00:04)    
erikh    tty1                          Thu Mar 22 00:27 - 00:27  (00:00)    
reboot   system boot  3.0.0-16-server  Thu Mar 22 00:20 - 00:31  (00:11)    
erikh    tty1                          Thu Mar 22 00:16 - down   (00:03)    
erikh    tty1                          Thu Mar 22 00:16 - 00:16  (00:00)    
reboot   system boot  3.0.0-12-server  Thu Mar 22 00:16 - 00:20  (00:03)    

Notice all the reboots in there? This is why we filter on USER_PROCESS above. The ut_type contains a lot more information than what we care about. Anyhow, this is explained better in man utmp, so go read that. There is also the POSIX utmpx which isn’t really any more consistent than utmp is across different unices.

So, about this lossy login accounting issue…

What to do about it? There are really two options:

  • Make sure your things are logging utmp entries.
  • Use something else, like log scanning.

In reality only one of these is the serious choice — there are other things like auditd and PAM controls that can assist here, but not much really. Log scanning and having tight control over how users get into your systems is the way to go. Since log scanning is such a deep article, we’ll cover it in a separate one. Stay Tuned.

Conclusion

The utmp system is typically relied on for a lot more than it should be; it’s inconclusive and generally flawed especially for non-interactive … interaction.