my ideal package manager

:: technical, computers, package manager

So my last post was a simple introduction to what package managers are, and some problems they solve. Here I will put forth my thoughts for the ideal package manager. This will be a much more technical post.

My previous post explained that package managers help you manage updates, get software from trusted sources, and more. Some more great features that package managers may have is giving you a snapshot of the state of your operating system files. It can help ensure that your system remains working and stable even if there are errors during the package update process. Below I will explain features of IPM (the Ideal Package Manager) and how they provide these (and more) benefits.

One of my biggest beefs with current package managers is pre/post install/update/remove scripts. They are an abomination. They can arbitrarily do anything to your system since they are run as root, making them a security problem if you shouldn’t trust your packagers. Even if your packagers are angelically benign, they are still human and make mistakes. Files controlled by normal package manager processes are put in a pristine state — you can be sure they are the same bits that were originally packaged. But scripts make packages mutate files, and without leaving a way to go back (unless you have support for this in the filesystem itself). I have personally been very frustrated by problems caused by package scripts and willy-nilly file modification without preserving history or save-points, so this is a pretty big deal to me.

My solution to scripts is to have a multi-stage package manager. Most package managers have some sort of recipe file for building packages (eg. Arch Linux PKGBUILD files). These tend to essentially tell where to get the source code from, how to validate that the source is correct, how to build the package, and any dependencies you need to build or use the package. Alternatively, some systems have source packages that are essentially the same, but with the source code included. These can be seen as 2-stage package systems. My plan extends this to an arbitrary number of stages. For example, stage 0 may be like a PKGBUILD — a pointer to where the source lives, and instruction to build it. Stage 1 would be like a source package - source code included, but not yet built (this may be the final stage for data packages or dynamic language packages like python or javascript code). Stage 2 would be built, which would be the final stage for most other packages. Stage 3 would be a configured package — this would take the place of scripts. For example, most Arch Linux packages have no install scripts (which is one of the primary reasons I use Arch Linux), but the Linux kernel does. The reason? While the kernel in the package is compiled, it needs an initramfs file which is configured according to local settings and probably feature detection on the hardware. A package complete with this initramfs can’t be distributed — it won’t work on different machines the way the kernel itself will, which is why it is generated by a script. But the initramfs is clearly an important file that deserves to be controlled by the package manager, to be tracked and rolled back with the kernel if need be.

Additionally, multi-stage packages give more options for distribution. You could distribute stage 0 packages for a ports-like built system (a la FreeBSD, Gentoo, Arch Build System). People could download the whole repository of stage 0 packages in little space and the packager would have no further bandwidth concerns. Stage 1 package hosting would provide similar benefits, but shift the hosting of the source to the packager. Stage 2 packages (IE compiled) would be what most people get today. Further stages probably can’t be distributed normally, since they probably require specialized configuration on the specific machine, but sysadmins who manage lots of identical machines may distribute these further configured packages as well. Whether or not you distribute each stage of package, you can optionally keep the stages generated so far for auditing, roll-back, or other concerns.

For this to work, each package stage needs to include the instructions for each later phase (IE what dependencies there are to build/download/configure that phase). It may even be advantageous for packages to contain their previous stages as well (eg. to later audit the build steps, or see any manual configuration that was done at any step). Ports-like system users are aware that many packages have various compile-time configuration options. This multi-stage system would allow you to systematically produce artifacts from each stage that can be configured further until the final stage. Most users will probably take the default options everywhere, but occasionally some users will enjoy configuring at stage 0 or 1, yet still getting a package in the end (rather than just depending on build scripts to copy into /bin /etc, /usr, and so on).

In my opinion root-owned files fall under the categories of package-managed files owned by a single specific package (that may be a dependency for several), or user-made configuration. Current package managers tend to let multiple packages muck up certain files that they both claim to need. I think a better way to handle files that multiple files want to change is to break that file out into a separate package. Here one package recipe/stage could generate two separate packages on the next stage, to keep this configuration separated. The next package that needs to modify the same file can once again copy the file, make a new version, and package it up. This would complicate certain packages, since new versions of it would be generated by multiple recipes, but it would be less complicated than not having real package management for those files. With package management, we can roll back these files without needing a special file system that will roll everything back, and we can audit the trail of edits. One technical detail here is that I think packages should be stamped with both a version and a build date, so that generated packages can optionally declare dependencies both on a version number of a package, as well as being sure that it’s using a dependency made recently enough to have the configurations that it needs.

Another good property of multi-stage packages and breaking shared files into their own oft-updated packages is that every stage can be built as a normal non-root user. The reason install scripts need to run as root is that they modify root-owned files. But if we can copy them and modify our own copies instead, it means that a normal user could build these packages in a fakeroot environment, the way AUR packages are built (IE before any install scripts need to be run). This takes those potentially nasty scripts that can do anything into scripts that can only modify files owned by the build user! It can still hose any files in “shared” packages, but you will be able to see which recipe trashed that file, as well as being able to roll back.

Also, package dependencies can be broken out with more granularity with respect to which phase of building or use the dependencies are needed. You may need git or download tools to go from a recipe to a source package (phase 0 to 1), but you don’t need it to build. Permissions could even be added so that unless a package declares a dependency on a network connection (IE to get the source), no network requests can be made. That last bit may not be terribly useful, but if you’re paranoid about security it could be a great feature.

Since I’ve mentioned building as a non-root user, let’s head to another feature I would like — package managing as a non-root user! Not managing the core OS files, mind you, but when you are a user without root privileges, your options for installing software tend to be… well… build by hand and manage updates… tediously. It would be great to be able to install packages into a subtree other than root (eg, $HOME/packageroot/) so that individual users can leverage package management for their own stuff.

Of course, the need to modify files is evidence of some poorly planned software. It means there is some configuration that multiple packages want to configure that has to be a single file. It is mucher better in these cases to have a directory that can contain snippets to concatenate, giving opportunity for both packages and the user to make modifications without trampling on each other. I don’t know of any configuration file, registry, or database that packages need to modify that is very big at all or that can’t be re-generated from a directory of pieces when a program runs, or at boot time, or some such thing. So with a little smart development (the simplest way being that your configuration file can source other files), we can avoid this whole problem. Of course, some operating systems have this feature built in at a very core level, so that every install script (since said OS eschews the wise ways of package management) needs to edit a global registry. This is cleary bad, and software developers should seriously avoid these situations.

On a lighter note, and back to package managing rather than complaining about crappy software, another thing scripts commonly do in packages is give messages. This is pretty innocuous. Packages can have some sort of field for messages to be printed/logged. That’s not so bad. You can give messages without allowing for arbitrary code execution.

Now that we have our multi-stage packages, let’s talk about where the files should go. Not in /bin, /etc, /usr, etc. The package manager should have its own directory, maybe /ipm (for our imaginary Ideal Package Manager). Inside there it can have directories where each package is unpacked, eg. /ipm/installed/packagename/version/date/. Then /bin/ls will be a symlink pointing to /ipm/current/bin/ls. /ipm/current, however, is also a symlink to /ipm/snapshot/some-time-stamp. Finally, that directory has a mirror of the root directory structure with symlinks to the actual files in their respective unpacked places (in /ipm/installed). Now this may seem like a lot of redirection, but it will be completely transparent — most package managers already use a number of symlinks for versions of programs and libraries. This, however, increases the level of abstraction to where we get even more nice qualities. Upon installing, packages are first put in their versioned/dated places, then linked into a snapshot directory, then finally the /ipm/current symlink is switched when everything else is in place. If a package update is interrupted by power failure, lack of disk space, etc, all the old files will still be there, and the symlinks will still point to valid files and configuration. The only vulnerability would be a failure in the middle of switching one symlink. It’s a pretty small vulnerability, and if it by chance did go wrong and there were a system failure between removing and re-adding the /ipm/current symlink, it would be a very easy fix. Additionally, if we find our new system unsatisfactory, we could roll back our entire system to its previous state by changing one symlink.

The downside to this, naturally, is that we will use up a lot of space if we keep all the old versions around. But this could be configured — keep everything, keep 1, 2, or n latest versions, or throw everything old away as soon as the final symlink is in place (like what most package managers currently do, except more fault-tolerant during installation).

Another advantage of using symlinks is configuration. Most packages ship with some default configuration that users often want to override. This causes conflicts and suffering while updating. With a symlinked system, users could be offered a directory tree such as /ipm/user-overrides/, which would mirror the root file tree just like packages do. While doing updates, the package manager will override symlinks to package files to instead point to the override directory for any file that exists there. That way, rather than editing /etc/foo.conf, you edit /ipm/user-overrides/etc/foo.conf. Your edits are always preserved, and so are all package versions. When new versions of shadowed files arrive, the user can be notified and choose whether to change the overriding file. Additionally, this means the files specifically configured by the user are all in their own directory, giving an easy opportunity to put them under version control. Wouldn’t it be great to keep your machine’s non-default configuration under git and have all that annoying symlinking be done automatically?

Finally, since current package managers have so many packages, it would be great to be able to leverage them. So it would be good for our package manager to be able to consume .deb, .rpm, .pkg.tar.xz, etc as 2 stage packages (stage 0 being the external package format, stage 1 being the translation (IE configuration and running scripts) to IPM format. That would also imply that the whole AUR would be full of 3 stage packages — all the benefits of Arch packages, with the added benefits of no scripts and symlinking! Additionally, it would be great to hook into the various package managers that exist for different languages — pip, cabal-install, npm, raco pkg, cpan, etc. It would be great to have a wrapper for each of these that knows which files produced will be the desired artifacts, and know which files are needed and mutated from installation to installation to be able to snapshot and roll back. Many of these formats are designed to be amicable to conversion to OS-level packages, so for many of them it should be straightforward.

To sum up, here is a review of my desired features: symlinked installation, no scripts, multi-stage packages with arbitrarily many stages, non-root user package building, non-installation-wide package management (IE in $HOME), and conversion of other package formats. I really would love this system, and if someone tells me of a package manager that has all these features, I will switch in a heartbeat.

Addendum

So some package managers have some of these features. In fact, recently I’ve learned about a couple called Nix and Guix. They both have a lot of these features, and some I have given less thought to. They don’t seem to have any sort of multi-stage packages like I would like. I need to spend some time to try them (especially Guix, which looks to be the most promising of the two), but cursory evaluation tells me that they solve many but not all of the problems I see in package managers. So they may not be my ideal yet, but I’m happy to see the improvement and work that’s gone into them.

One other cool feature I forgot to mention here is the ability to have a mixed system from different types of repositories, a la Debian. For instance, Debian has different repositories for different releases, indicating how cutting-edge and volatile, or alternatively old, safe, and boring packages are. A singe system can use one of these as its main package repository but have individual packages installed from any of them. This raises some dependency issues, of course. Guix and Nix take a nice approach of letting multiple versions of packages be installed and linking up each package explicitly to the version it depends on to solve this sort of problem, but most package managers run in to trouble with this sort of thing.

Ooooone more cool feature I want in package managers is license info as a standard piece of metadata. I would love to be able to filter package lists by license, get a list of licenses used by a set of packages, etc.