Rant

I can't begin to count the number of times I've started off some new project; I've tried them all with the best of intentions -- molecular dynamics, Quantum Monte Carlo, Hartree-Fock, etc. Of course, these are just toy projects, but I'm far too precious about my code to do a half-arsed job, so invariably try to write everything with as many software good practices in mind as possible.

Then comes the realisation that I need to do a bunch of configuration parsing for all manner of atomic and molecular properties. And I slowly lose the will to live, leading to the eventual discarding of the project. Fundamentally, I don't care about text parsing. I want to get it over and done with as quickly as possible, but I'm also loathed to write software that means I end up duplicating some bit of code looking for tonnes of different tokens. This is an attempt to create as generic a configuration parser as possible that's both easily extensible and maintainable. Let's see what we can do.

The Configuration File Format

Sigh. Talk about tedious. The configurations we're using in computational chemistry aren't enormously complicated; I don't need all of the bells and whistles that some configuration file formats offer. During my postdoctoral work, I used XML for QMC configurations because it was a format that another popular software package uses. It's such an ugly format that I feel justified in dismissing it this time around without needing to think about it too deeply.

Then comes JSON, the format for structured data that seems to be the most popular because of the libraries that are available to manipulate the configuration file really easily. JSON is nice enough, but again, from an aesthetic perspective, the nestings and endless curly braces are really ugly if you're going beyond one or two levels of nesting. The lack of comment support in JSON can be a bit irritating at times as well, so perhaps there's an alternative.

I hadn't really come across TOML before. I'd seen it in the occasional project, but hadn't thought a great deal about it. It's simple, clean, has great third-party library support for parsing (see toml++, which really embraces a lot of modern C++ features), supports comments and can handle numerical data. So without thinking about it too deeply, I decided to give TOML a go.

For the start, we're going to have a pretty simple TOML for atom types, but hopefully the methods we develop later on will be generic enough to allow the configuration file to become arbitrarily complex without too much refactoring:

[AtomTypes.H]
mass = 1.0
nuclear_charge = 1
num_electrons = 1

[AtomTypes.O]
mass = 16.0
nuclear_charge = 8
num_electrons = 8

[...]

The Atom Type Container

There are two factors that will influence the way in which we design the atom type container:

  1. The atom type container needs to support arbitrarily many parameters since these things can get pretty complicated; if we're doing molecular dynamics, we might want to have multipole moment expansions for electrostatics, some bizarre dispersion functional form, etc.
  2. The atom type container needs to support optional parameters; if we're doing an electronic structure calculation, we don't care about Lennard-Jones parameterisations.

Because of consideration number one, we want to keep away from having an enormous constructor for the atom type container. This is just really grim, and it makes the handling of optional parameters a little difficult. The builder design pattern though is perfect for this kind of scenario; we incrementally build up the atom type as we read from the config file, and only set those fields that we care about.

First of all, let's write out AtomType:

class AtomType {
public:
    AtomType(std::string atom_type) : atom_type_{atom_type} {}
    friend std::ostream& operator<<(std::ostream& os, const AtomType& a);
    friend AtomTypeBuilder;

    static AtomTypeBuilder create(std::string name);

private:
    std::string atom_type_;
    float mass_;
    int nuclear_charge_, num_electrons_;
};

Nothing particularly interesting here; it's just a container for various parameters that we might want to set for an atom type. How about the builder:

class AtomTypeBuilder {
public:
    AtomTypeBuilder(std::string atom_type_name) : atom_type_{atom_type_name} {}

    AtomType build();
    AtomTypeBuilder& mass(std::optional<float>&& mass);
    AtomTypeBuilder& nuclear_charge(std::optional<int>&& nuclear_charge);
    AtomTypeBuilder& num_electrons(std::optional<int>&& num_electrons);

private:
    AtomType atom_type_;
};

We've omitted a lot of the implementation details, but the reader is directed to the thousands of other articles on the builder pattern if they want to understand how this works. Note that in the builder, we pass in std::optional for each of the parameter values. This is a really nice feature of the toml++ library, where parsing a token from the TOML yields a std::optional where the value could just be a std::nullopt if the key wasn't present in the TOML. This facilitates our optionally setting parameters for the atom type.

The TOML Reader

So now we have the fundamental bits of code that'll allow us to create and wrap an atom type, we need to do some parsing and create some data structure of atom types that we can use going forwards. First of all, let's try to read out TOML:

#include <toml++/toml.h>

class AtomTypeReader {
public:
    AtomTypeReader(toml::table config_table) 
        : atom_types_config_{*config_table["AtomTypes"].as_table()}
    {}

private:
    toml::table atom_types_config_;
};

int main() {
    toml::table config = toml::parse_file("config.toml");
    AtomTypeReader reader(config);
}

Here, we're parsing our TOML file and in the AtomTypeReader constructor, retrieving the part of the config file that we're after; the "AtomTypes" node. Note that we store this as a table for later convenience in the private member atom_types_config_. Let's try to write a method that can interface with the AtomTypeBuilder and construct an AtomType:

AtomType AtomTypeReader::parse_atom_type(std::string atom_type) {
    try {

        auto atom_type_table = atom_types_config_[atom_type];
        return AtomType::create(atom_type)
            .mass(atom_type_table["mass"])
            .num_electrons(atom_type_table["num_electrons"])
            .nuclear_charge(atom_type_table["nuclear_charge"])
            .build();

    } catch (const toml::parse_error& err) {
        std::cerr << "Parsing failed with error: " << err << '\n';
    }
}

It's as easy as that! We go looking for the appropriate element node within the "AtomTypes" section of our TOML, and throw an error if the element wasn't provided. Then all we do is interface with the builder and return the newly created AtomType from the method. How about a higher-level method that iterates through all atom types provided in the TOML and returns some data structure that can be queried for atom types?

std::map<std::string,std::shared_ptr<AtomType>> AtomTypeReader::parse() {

    std::map<std::string,std::shared_ptr<AtomType>> atom_types;

    for (auto key : atom_types_config_) {
        std::ostringstream ss;
        ss << key.first;
        std::string name = ss.str();
        atom_types[name] = std::make_shared<AtomType>(parse_atom_type(name));
    }

    return atom_types;

}

You'll see that we've chosen a map which maps from the atom type identifier (the element name) to a shared pointer to the newly created atom type we've parsed from the TOML. This way, when we come to assigning an atom type to each atom in our system, we simply have a shared pointer to the atom type, and query the properties we're looking to use.

Defaults and Optional Arguments

We've neglected to provide any implementations for the AtomTypeBuilder methods thus far. There are plenty of atom type parameters that we can satisfy ourself with defaults for. We don't need to trouble the user with supplying an atomic mass to 12 significant figures when atomic masses are basically physical constants. However, the user may want to use an isotope, and as a result supply their own mass. As a result, we want a mechanism whereby if a field isn't provided in the TOML for an atom type, we can go looking in some defaults that are hard-coded within our software. A structure a bit like the following:

static const std::map<std::string,toml::table> atom_type_defaults{
    {"H", toml::table{{"mass",  1.0}, {"num_electrons", 1}, {"nuclear_charge", 1}}},
    {"O", toml::table{{"mass", 16.0}, {"num_electrons", 8}, {"nuclear_charge", 8}}},
    ...
};

Let's take AtomTypeBuilder::mass as an example, then discuss the behaviour:

AtomTypeBuilder& mass(std::optional<float>&& mass) {

    atom_type_.mass_ = *mass;

    if (mass == std::nullopt) {
        try {
            atom_type_.mass_ = *atom_type_defaults
                .at(atom_type_.name_)["mass"].value<float>();
        } catch (std::out_of_range e) {
            std::cerr << "No default mass parameter for atom type " 
                      << atom_type_.name_ << '\n';
        }
    }

    return *this;
}

We start off by assuming the optional has a valid value in it and assigning it to the builder's AtomType object's mass variable. Then we check whether the optional actually contained anything in the first place. If it did, we don't need to take any further action since we've already assigned the AtomType's mass. However, if the optional was a std::nullopt, we need to go looking in our atom_type_defaults map for a default mass. If we don't have an entry in this map for the atom type we're trying to set a default for, clearly there's nothing more we can do besides error out and indicate that the user should provide their own value for the atom type's mass. Otherwise though, we just proceed as normal with the default mass, and the user has no need to provide an atom's mass in their configuration TOML.

The above implementation for AtomTypeBuilder::mass is a bit tedious, and we're going to have to do likewise for each field of the AtomType we want the builder to be able to initialise. As such, it becomes prudent to have a generic method that we defer to:

template<typename T>
void AtomTypeBuilder::set_param(
    std::string name, T& setter, std::optional<T>&& val
) {
    setter = *val;

    if (val == std::nullopt) {
        try {
            setter = *atom_type_defaults
                .at(atom_type_.name_)[name].value<T>();
        } catch (std::out_of_range e) {
            std::cerr << "No default " << name 
                      << " parameter for atom type " << atom_type_.name_ 
                      << '\n;
        }
    }
}

As a result, our builder's methods look much less cluttered:

AtomTypeBuilder& mass(std::optional<float>&& mass) {
    set_param<float>("mass", atom_type_.mass_, mass);
    return *this;
}

AtomTypeBuilder& num_electrons(std::optional<int>&& num_electrons) {
    set_param<int>("num_electrons", atom_type_.num_electrons_, num_electrons);
    return *this;
}

AtomTypeBuilder& nuclear_charge(std::optional<int>&& nuclear_charge) {
    set_param<float>(
        "nuclear_charge", atom_type_.nuclear_charge_, nuclear_charge);
    return *this;
}
Comments


Published

Category

software

Get In Touch