An overview of MlFront. Part 3 - MlFront_Boot

Aug 6, 2024

As I was peeling MlFront out of DkCoder, I realized that I could not transfer most of my integration tests; those integration tests relied on a fully functioning DkCoder system.

Out of the practical need for integration tests, I built the MlFront_Boot build system. In a later post I will describe how you can use MlFront_Boot to do a security analysis of your source code. But for now, let’s see how MlFront_Boot works because that will help you understand what the security analysis will accomplish. And for those readers who implement their own build systems … you can treat MlFront_Boot as the MlFront reference build system which can be copied and mimicked.

Here is a minimal MlFront_Boot project. You only need source code arranged in the MlFront package structure:

.
├── AcmeWidgets_Std/
│   └── A.ml
└── BobBuilder_Std/
    └── B.ml

(* AcmeWidgets_Std/A.ml *)
  let print_self () = print_endline "I am an A!"

(* BobBuilder_Std/B.ml *)
  let print_self () = print_endline "I am a B!"
  let () =
    AcmeWidgets_Std.A.print_self ();
    print_self ()

You can already see that MlFront_Boot supports cross-package references from BobBuilder_Std/B.ml to AcmeWidgets_Std/A.ml.

Sidebar: It’s early! At the time of the post, MlFront_Boot lacks the implementation code to support AcmeWidgets_Std.Some.Sub.Package deeply nested subpackages, and has a few other gaps. I’ll edit this post and remove this sidebar once sufficient code for security analysis has been ported from DkCoder.

The MlFront_Boot build system is run using the executable mlfront-boot. mlfront-boot can be built from source, but prebuilt binaries are distributed at https://gitlab.com/dkml/build-tools/MlFront/-/releases.

When run, mlfront-boot will analyze the source code:

mlfront-boot -o buildscript

and generate a Windows batch script and a POSIX (macOS/Linux) shell script:

.
├── AcmeWidgets_Std/
├── BobBuilder_Std/
├── buildscript.cmd  <-- generated
└── buildscript.sh   <-- generated

Those generated scripts are simple to audit, fully self-contained and can be checked into source control.

When you run the script (buildscript.cmd on Windows or buildscript.sh on macOS/Linux), your only responsibility is to ensure that ocamlc is available in your PATH. Running the script will build executables for your project:

directory create: target/
file create: target/AcmeWidgets_Std.ml
link create: AcmeWidgets_Std/A.ml -> target/AcmeWidgets_Std__A.ml
compile: AcmeWidgets_Std.A
compile: AcmeWidgets_Std
link create: BobBuilder_Std/B.ml -> target/BobBuilder_Std__B.ml
compile: BobBuilder_Std.B
executable create: BobBuilder_Std.B

You can now run the bytecode executable:

$ ocamlrun target/BobBuilder_Std.B.bc
I am an A!
I am a B!

If you had used the -native option at boot time, you could run the native executable:

$ mlfront-boot -native -o buildscript

$ ./buildscript.sh # or .\buildscript.cmd on Windows

$ target/BobBuilder_Std.B
I am an A!
I am a B!

How does it work?

MlFront_Boot moves through these steps:

Scan the directories for source code. Each module in your project will be fully-qualified. For example, the module AcmeWidgets_Std/A.ml became target/AcmeWidgets_Std__A.ml to avoid any module collisions. This transformation is in memory and nothing has been written to disk.

Advanced. At the time of this post, there is a limited version of a namespaces specification used by MlFront_Boot. Only the scan and the merge strict expressions are used.

Convert each compilation “unit” (from the second post) into a simplified module structure called the “Module Meta Language” (M2l). That conversion is performed by the codept dependency analysis tool. It strips away the types and values and just leaves what modules are used.

For example, these two files:

 (* AcmeWidgets_Std/First.ml *)
 module A = struct
   module Inner = struct let f x = x end
 end

and

 (* AcmeWidgets_Std/Second.ml *)
 let x = 1
 open First.A
 module B = struct let y = Inner.f x end

are converted to the following in-memory:

 (* AcmeWidgets_Std/First.m2l *)
 module A = struct
   module Inner = struct end
 end

and

 (* AcmeWidgets_Std/Second.m2l *)
 open First.A
 module B = struct [%access {Inner}]  end

All the M2l units are given to a codept solver, with the name of each unit being the fully-qualified name from step 1.

When codept evaluates AcmeWidgets_Std/First.m2l, it populates an in-memory “resolved” environment with the following modules:
```
module AcmeWidgets_Std__First = struct
  module A = struct
    module Inner = struct end
  end
end
```
Then when codept evaluates AcmeWidgets_Std/Second.m2l it fails to resolve both the First and Inner module references:
```
(* AcmeWidgets_Std/Second.m2l *)

(* codept does not know the [First] module.
   The correct module is
   [AcmeWidgets_Std__First]! *)
open First.A

(* codept does not know [Inner].
   The correct module is
   [AcmeWidgets_Std__First.A.Inner]. *)
module B = struct [%access {Inner}]  end
```
At this point codept fails and informs MlFront that AcmeWidgets_Std__Second is missing a module reference to “A” and “Inner”.
MlFront knows that both “A” and “Inner” are relative module references because they do not follow the pattern
```
VendorProject_Unit
│     │      ││
│     │      │└ UPPERCASE
│     │      └ UNDERSCORE
│     └ UPPERCASE
└ UPPERCASE
```
we saw last post.

That means MlFront will ask MlFront_Boot through its respond_to_missing_module function if it can find the modules AcmeWidgets_Std__First and AcmeWidgets_Std__Inner.
MlFront_Boot already knew from step 1 that AcmeWidgets_Std__First was available, and responds to MlFront to add AcmeWidgets_Std__First as the First alias.

Now AcmeWidgets_Std__Second looks like:
```
 (* AcmeWidgets_Std/Second.m2l *)
 module First = AcmeWidgets_Std__First
 open First.A
 module B = struct [%access {Inner}]  end
```
MlFront gives the slightly tweaked Second.m2l back to codept, which can fully resolve all of the references.

In general, MlFront looks at each .m2l in-memory file, and then asks the build system (ex. MlFront_Boot) to respond with the locations of missing modules for missing relative module references. MlFront will also ask the build system to respond with the locations of missing libraries for missing absolute module references like BobBuilder_Std. Any response results in a slight modification to the .m2l in-memory file. MlFront will keep rerunning codept runs until there are no more missing module references.

After there is a full picture of the project, the build system can write out its build scripts (ex. buildscript.sh and buildscript.cmd for MlFront_Boot). The build system is responsible for replicating the .m2l modifications (ex. we added the module First = AcmeWidgets_Std__First alias earlier) in its build scripts. The build system can use -open SomeModification in its ocamlc / ocamlopt flags to accomplish those modifications.

Build systems can also synthesize their own modules at any point of the cycle. MlFront_Boot does not use this functionality very much but DkCoder uses it heavily.

Implicit modules: Modules created automatically after the analysis of a single unit.

For example, with DkCoder, when the following unit is scanned:
```
let () =
  Printf.printf
    "All assets are in %s\n"
    (Tr1Assets.LocalDir.v ())
```
When MlFront tells DkCoder that it must respond to the missing Tr1Assets module, DkCoder creates the Tr1Assets module automatically from the contents of an assets folder (images, audio files and other static resources).
Optimistic modules: Modules created automatically in response to many units being scanned.

For example, with DkCoder, when units are scanned in a package directory (ex. AcmeWidgets_Std/Something/*.ml), units are scanned in the subpackages (ex. AcmeWidgets_Std/Something/Deeper/*.ml) where “subpackage modules” are created automatically. That was a mouthful! Basically, AcmeWidgets_Std.Something.A can only access AcmeWidgets_Std.Something.Deeper.B if someone created the subpackage module AcmeWidgets_Std.Something.Deeper. An optimistic module has holistic access to many modules which enables those subpackage modules to be generated on your behalf.

In summary, MlFront_Boot, through unique identification of modules and codept-based analysis, is a build system generator that writes portable build shell scripts backed by an accurate enumeration of modules used in a program. You need some more equipment to do the security analysis we mentioned in the first post, but today you learned how the main equipment works.