Training a MACE-style GNN with **klay** + **kliff** =================================================== This tutorial shows the minimal end-to-end workflow for 1. defining a Graph Neural Network in **klay** using a YAML file, 2. compiling the model to an FX graph and TorchScripting it, 3. connecting the model to a **kliff** *Lightning* trainer, 4. running a short training loop and exporting a KIM-compatible model. The same pattern extends to larger datasets, multi-GPU training (`strategy="ddp"`), and more complex model graphs. Prerequisites ------------- * klay * PyTorch >= 2.2 (built with CUDA if you train on GPU) * e3nn -> for converting equivariant models to jit * kliff * torch_geometric -> for graph datasets * torch_scatter -> torch_geometric dependency used by several layers/packages * lightning -> for distributed GNN trainer in kliff * tensorboard, tensorboardX -> for logging GNN trainer You can create a valid klay + kliff env (for CPUs) using conda as: .. code-block:: bash conda create -n klay-env conda activate klay-env conda install -c conda-forge python=3.9 pip install klay pip install torch==2.2.0 --index-url https://download.pytorch.org/whl/cpu pip install torch_geometric pip install lightning pip install torch_scatter torch_sparse -f https://data.pyg.org/whl/torch-2.2.0+cpu.html pip install kliff pip install tensorboard tensorboardX Directory layout used below ^^^^^^^^^^^^^^^^^^^^^^^^^^^ :: . ├── mace_model.yaml # model definition (see below) ├── Si_training_set_4_configs # four config files with energies/forces └── train_mace.py # training script (listing follows) .. note:: You can use your own dataset, or download the above **toy** dataset as .. code-block:: bash wget https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz tar -xvf Si_training_set_4_configs.tar.gz -------------------------------------- Model definition (``mace_model.yaml``) -------------------------------------- The YAML file enumerates **parameters**, **I/O tensors**, a **layer graph**, and **named outputs**. klay resolves the `${...}` references at build time, so you only declare each hyper-parameter once. .. code-block:: yaml model_params: r_max: 4.0 n_channels: 32 num_elems: 2 model_inputs: species: "Tensor (N,)" coords: "Tensor (N,3)" edge_index0: "Tensor (2,E)" contributions: "Tensor (E,)" model_layers: element_embedding: type: OneHotAtomEncoding config: {num_elems: 2} inputs: {x: model_inputs.species} edge_feature0: type: SphericalHarmonicEdgeAttrs config: {lmax: 1} inputs: pos: model_inputs.coords edge_index: model_inputs.edge_index0 output: {0: vec0, 1: len0, 2: sh0} radial_basis_func: type: RadialBasisEdgeEncoding config: r_max: ${model_params.r_max} inputs: edge_length: len0 node_features: type: AtomwiseLinear config: irreps_in_block: - {"l": 0, "mul": '${model_params.num_elems}'} irreps_out_block: - {"l": 0, "mul": '${model_params.n_channels}'} inputs: {h: element_embedding} conv1: type: MACE_layer config: lmax: 1 correlation: 2 num_elements: ${model_params.num_elems} hidden_irreps_block: - {"l": 0, "mul": '${model_params.n_channels}'} - {"l": 1, "mul": '${model_params.n_channels}'} input_block: ${model_layers.node_features.config.irreps_out_block} node_attr_block: ${model_layers.node_features.config.irreps_in_block} inputs: vectors: vec0 node_feats: node_features node_attrs: element_embedding edge_feats: radial_basis_func edge_index: model_inputs.edge_index0 output_projection: type: AtomwiseLinear config: irreps_in_block: - {"l": 0, "mul": '${model_params.n_channels}'} - {"l": 1, "mul": '${model_params.n_channels}'} irreps_out_block: - {"l": 0, "mul": 1} inputs: {h: conv1} contributions_energy: type: KIMAPISumIndex inputs: src: output_projection index: contributions model_outputs: energy: contributions_energy ----------------------------------- Training script (``train_mace.py``) ----------------------------------- The Python driver wires the model into **kliff**’s ``GNNLightningTrainer``. All training hyper-parameters live in a single ``training_manifest`` dictionary so they are logged together and can be re-used for checkpoint-free restarts. .. code-block:: python import torch torch.set_default_dtype(torch.float64) from klay.builder import build_model from klay.io import load_config from e3nn.util import jit # ------------------------------------------------------------------ # Build & script the model # ------------------------------------------------------------------ mace_model = build_model(load_config("mace_model.yaml")) mace_model = jit.script(mace_model) # TorchScript -> picklable, deterministic # ------------------------------------------------------------------ # Experiment manifest # ------------------------------------------------------------------ workspace = {"name": "GNN_train_example", "random_seed": 12345} dataset = { "type": "path", "path": "Si_training_set_4_configs", "shuffle": True } model = {"name": "MACE1", "input_args": ["species", "coords", "edge_index0", "contributions"] } transforms = { "configuration": { "name": "RadialGraph", "kwargs": {"cutoff": 4.0, "species": ["Si"], "n_layers": 1} } } training = { "loss": { "function": "MSE", "weights": {"config": 1.0, "energy": 1.0, "forces": 10.0}, }, "optimizer": {"name": "Adam", "learning_rate": 1e-3}, "training_dataset": {"train_size": 3}, "validation_dataset": {"val_size": 1}, "batch_size": 1, "epochs": 10, # accelerator/strategy left on "auto" so the same script runs on CPU or GPU "accelerator": "auto", "strategy": "auto", } export = {"model_path": "./", "model_name": "MACE1__MO_111111111111_000"} training_manifest = { "workspace": workspace, "model": model, "dataset": dataset, "transforms": transforms, "training": training, "export": export, } # ------------------------------------------------------------------ # Train # ------------------------------------------------------------------ from kliff.trainer.lightning_trainer import GNNLightningTrainer trainer = GNNLightningTrainer(training_manifest, model=mace_model) trainer.train() trainer.save_kim_model() -------------------- Running the tutorial -------------------- .. code:: bash python train_mace.py # prints a Lightning progress bar With only four Si configurations and 10 epochs this runs in seconds on CPU. The call to ``save_kim_model`` writes a LAMMPS-compatible ``MACE1__MO_111111111111_000`` file plus a JSON metadata block. Files produced ^^^^^^^^^^^^^^ * ``lightning_logs/...`` – TensorBoard logs, checkpoints * ``MACE1__MO_111111111111_000`` – portable potential Next steps ---------- * Swap the tiny path dataset for a real one (e.g. ANI-1x or OC20). * Increase ``epochs`` and ``batch_size``; pick ``strategy="ddp"`` to distribute across multiple GPUs. * Add more **MACE_layer** blocks or deeper radial graphs in the YAML to improve capacity. * Use ``kliff``’s ``EarlyStopping`` and ``LearningRateMonitor`` callbacks for production runs.