eBPF File System Monitoring

This blog post is about something I am currently working on at work, so it will be partly about file system monitoring, somewhat about the file system code in the Linux kernel and a whole lot about eBPF, which is very cool.

Say we want to monitor file system changes on a system that hosts thousands of containers for thousands of customers that can be accessed however they like. Most commonly files are modified through SSH/SFTP, but it could be anything they can come up with. It should also work fairly independent of the filesystem. And we want to know about such files changing rather quickly (sub-second would be great), so scanning once a day is not good enough. Of course we are on Linux, because this is where all the cool things happen and also where all the cool kids hang out. We don’t just want to know about all operations happening on files, but operations that would have us read something else from them, i.e. that make their contents change. I.e. a write by itself is not interesting, but a write followed by a flush or a close is interesting!

Existing Monitoring APIs

inotify

This API is most commonly used for filesystem monitoring on Linux I would say (apart from possibly stat-ing in a loop), because it doesn’t require any capabilities or privileges, it’s easy to use and can do most things people care about.

Unfortunately you have to watch every file and directory separately - you can’t watch recursively! “inotify” stands for “inode notify”, because you monitor single inodes. Of course this scales very badly with a large (or very large - as in our case) number of files. On my personal computer with just my stuff on it I have seen Sublime Text exhaust the number of inotify file descriptors multiple times (the limit can be increased, I know).

fanotify

Instead of monitoring single inodes, you can monitor mount points, which is much better, but with many thousands of containers, which all have their own mount namespace, it might not scale well either. It also requires CAP_SYS_ADMIN to use.

eBPF

This is gaining popularity in recent years for all sorts of reasons and people have been doing many cool things with it (see bpftrace and bcc). It allows you to attach small programs to hooks, which are existing functions in user space or in kernel space. These small programs (ideally) do not risk the stability or security of your operating system, because they are compiled to a dedicated bytecode, which is verified by the Kernel when it is attached to a hook. The eBPF verifier makes sure that your program terminates in finite time, that stack size is bounded and that it does not access memory out of bounds. More on that in a later section.

The most important aspects that make eBPF attractive for high performance file system monitoring is that we can pre-filter events in the Kernel, to reduce the amount of events we have to transfer. We can also mark data as soon as the syscall is issued or returned, which can reduce latency. Also we only require CAP_BPF and CAP_PERFMON to attach a eBPF program, but might require more depending on the program and attach point. After it is attached, we don’t need any special permissions in user space to process the events generated by the eBPF program. Of course there is no state about watched items and way fewer layers than for example fanotify (vfs -> fsnotify -> fanotify -> userspace). You can also optimize how exactly events are dispatched. And if you want to add additional information (like the mount namespace, the pid of the process that modified the file, etc.) you can (often) trivially add this information.

So there are definitely good reasons it might be cool, let’s try it!

What to Trace

Maybe this is not a smart way to think about this, but I considered ways I can edit a file, which syscall sequence this corresponds to and then looked at functions that might be good candidates. I am sure I have forgotten many.

Operations

These are the ones I could come up:

  • write a (non-zero) number of times and then close (C API: fwrite, fclose)
  • write a (non-zero) number of times and then fsync (C API: fflush)
  • write a (non-zero) number of times and the kernel does an automatic write-back
  • writev
  • mmap, modify the mapped memory and then call msync
  • mmap, modify the mapped memory and then call close
  • mmap and the kernel does an automatic write-back
  • sendfile
  • splice
  • truncate
  • rename

Now we need to find the functions in the Linux kernel that we need to trace to catch all these different ways a file can be modified. Ideally we want to pick functions that take parameters from which we can easily obtain the information we want to collect.

In user space Linux programs tend to interact with files using file descriptors, which are just integers, but internally these will be mapped to struct file objects. File system related system calls then call into VFS (Virtual File System), which is an abstraction layer over all file systems with functions like vfs_write, vfs_close and many more. VFS functions then use file->f_op of type struct file_operations*, which is just a collection of function pointers like read, open, flush pointing to the actual functions that implement the specific file system (of which there are many).

If for example we traced on the close system call itelf, we would just get the file descriptor in our eBPF program and we would have to get all the useful information from that ourselves, which is often not even possible in a restricted eBPF program, because you cannot call arbitray kernel functions. So it makes sense to trace on vfs_close, which takes a struct file instead. Unfortunately this function particular is inlined! And if there is no jump to that function, we cannot hook into it. The next interesting function in the call chain is filp_close, which calls into f_op directly, so it’s the last function to trace that is not already part of a specific file system.

This also means that flushing on close is not handled by VFS, but by the file system itself and we can’t get away by just tracing the flush operation, but we have to trace filp_close.

The next way to modify a file we consider is fsync. fsync calls into vfs_sync, which calls into vfs_sync_range, which then calls into f_op, so vfs_sync_range is what we want to trace additionally.

Similarly the automatically triggered write-back of dirty pages and msync all end up in vfs_sync_range, but I won’t go into detail for all of these. You may dig through the Kernel source yourself to verify.

You would think writev just does write, but that’s not true. It calls into vfs_writev!

sendfile calls do_splice_direct if the output is a file and eventually ends up calling do_splice_from, which calls fs_op->splice_write directly without a VFS layer function anywhere inbetween. splice also ends up in do_splice_from.

truncate and rename call into vfs_truncate and vfs_rename respectively.

This leaves us with the following functions to trace:

  • filp_close
  • vfs_sync_range
  • vfs_writev
  • do_splice_from
  • vfs_truncate
  • vfs_rename

I had hoped there were fewer.

Tricky Bits

Paths

As a human I cannot deal well with the sort of data that is passed to e.g. filp_close. That’s just a struct file*. I want more - I want paths. I started with filp_close and struggled immensely to get the absolute path from the struct file* only to learn later that this was one of the easy functions.

Essentially a struct path in the Linux kernel is a struct dentry *dentry and a struct vfsmount *mnt. The dentry contains the name of the filesystem object and a pointer to its parent dentry. This is not enough, because the end of this linked list is not the filesystem root, but the root of the mount point. This is why you need the struct vfsmount *mnt as well to get to the full path.

So essentially you have to read the kernel source and figure out ways to get to the data you want from the data you have. It’s not impossible, which is why it’s not listed under problems, but it is tricky, hence the section this piece of text is placed in.

The Verifier

A good chunk of the time, especially in the beginning, I spent on trying to please the eBPF verifier. After you learn how it works and some tricks, it stops being a constant annoyance.

As you might know it is hard (read: in general impossible) to say whether a program terminates. So the eBPF verifier imposes some restrictions on the kinds of programs you write. As the verifier has to enumerate all possible paths of your program and analyze them individually, you may only produce a finite number of paths. In practice this number doesn’t just have to be finite, but bounded (1 million instructions in total at the time of this writing). Essentially this means that you mostly have to use bounded loops and if you can’t tell that your program is finishing in finite time, the verifier will not either.

So instead of this:

while ((node = node->next)) {
    // ...
}

you should probably write this:

for (int i = 0; i < MAX_NUM_NODES; ++i)
{
    /// ...
    node = node->next;
}

I haven’t had much trouble with this limiation and on top of that the verifier has become much smarter about unbounded loops in recent kernel versions.

The really tricky bit are memory accesses though. In regular C code you would confidently write this:

if (offset + size <= BUF_SIZE) {
    bpf_probe_read_kernel(buf + offset, size, some_kernel_ptr);
}

That’s safe, right? Well, go ask the verifier - he will not agree. The verifier keeps track of every register and it’s possible value range, but it will not track the range of a value in relation to another. So if for example the only thing the verifier knows about a value offset is that in all possible universes it might be in the range [0, 4096], another value size is also in [0, 4096] and BUF_SIZE, the size of the buffer buf, is 4096, then there is no way to encode the constraint from if (offset + size <= BUF_SIZE). Consequently it will complain that bpf_probe_read_kernel might cause an out of bounds memory access, because offset and size are both in [0, 4096]. After all it is possible that both offset and size are 4096.

In practice this means that you need to double your buffers a lot, so that e.g. if offset and size are in [0, 4096], the size of buf should be 8192. Then you can never have an out of bounds memory access.

You also need to sprinkle your code with macros that explicitly limit the range of values, like this:

#define PATH_BUF_SIZE 4096
#define LIMIT_PATH_BUF_SIZE(x) ((x) &= (PATH_BUF_SIZE - 1))

// ...

LIMIT_PATH_BUF_SIZE(path_start);

When I started working on this I thought I could get away with just programming carefully, but you need to look at the eBPF assembly and see what’s actually going on to understand some of the errors.

For example look at this funny little thing:

name_len++;
if (name_len > path_start) {
  break;
}
; name_len++;
97: (07) r2 += 1                       ; R2_w=scalar(umin=1,umax=4294967296,
                                       ;   var_off=(0x0; 0x1ffffffff))
98: (bf) r1 = r2                       ; R1_w=scalar(id=6,umin=1,umax=4294967296,
                                       ;   var_off=(0x0; 0x1ffffffff))
                                       ; R2_w=scalar(id=6,umin=1,umax=4294967296,
                                       ;   var_off=(0x0; 0x1ffffffff))
99: (67) r1 <<= 32                     ; R1_w=scalar(smax=9223372032559808512,
                                       ;   umax=18446744069414584320,
                                       ;   var_off=(0x0; 0xffffffff00000000),
                                       ;   s32_min=0,s32_max=0,u32_max=0)
100: (77) r1 >>= 32                    ; R1=scalar(umax=4294967295,
                                       ;   var_off=(0x0; 0xffffffff))
; if (name_len > path_start) {
101: (25) if r1 > 0x1000 goto pc+2051  ; R1=scalar(umax=4096,var_off=(0x0; 0x1fff))

Here name_len is stored in r2. In instruction 97 it is incremented. path_start is just 4096. The umin/umax values in the comments indicate the known ranges of the registers and are just another way to display var_off. The problem with this snippet is that the verifier will not update the range of name_len, despite if (name_len > path_start) break. In general it will exclude paths that you break from and update value ranges accordingly. This is because here name_len is a 32-bit unsigned int and r2 is a 64-bit register, so the increment might push r2 out of 32-bit range. To perform the check in instruction 101, the compiler has to bring r1 back into 32-bit range by shifting it up (instruction 99) and down again (instruction 100). Unfortunately the compiler decided to use another register (r1) for this and as a result the verifier doesn’t know that it should update r2. It’s possible to program around these things, but it can be very annoying and it will frequently bug you.

Actual Problems

While those things above need getting used to and are managable, there are also problems that aren’t solved well with elbow grease.

Paths

I mentioned that filp_close was one of the easy functions, implying there might more difficult ones. The one I had in mind here is vfs_rename. It takes a struct renamedata*, which looks like this:

struct renamedata {
    struct mnt_idmap *old_mnt_idmap;
    struct inode *old_dir;
    struct dentry *old_dentry;
    struct mnt_idmap *new_mnt_idmap;
    struct inode *new_dir;
    struct dentry *new_dentry;
    struct inode **delegated_inode;
    unsigned int flags;
} __randomize_layout;

Just dentrys and inodes. But to build a full path I need a vfsmount! I haven’t found a way to get it. I considered other functions to trace, but I haven’t found a good solution. You cannot call into arbitrary functions in the kernel from eBPF, which is why it’s not as easy as just doing what you would do in the kernel.

My current workaround is to just traverse the dentry list and build a path up until the mount root. Then include this path in the event passed to userspace and the mount namespace id. As root, this should be enough to uniquely identify the file that was modified. I am sure I am still missing something for when the same filesystem is bind-mounted multiple times.

Unstable ABI

Another reason I have to go with the mount root path is that to traverse up to the parent mount, I need to get the struct mount that contains the struct vfsmount. I use container_of for this, but struct mount is not in a public header, so I have to define a (sufficient) subset myself. This is what I did:

struct mount {
    struct hlist_head mnt_hash; // list_head -> hlist_head Al Viro 2014 (same size)
    struct mount *mnt_parent;
    struct dentry *mnt_mountpoint; // Al Viro 2011 inserted
    struct vfsmount mnt;
    // lots more fields, that are partially dependent on compilation flags
};

I added some comments to note when these fields were last changed. Linux tries to have a stable API (“WE DO NOT BREAK USERSPACE”), but it does not have a stable ABI! This means such structs may change. This means your eBPF programs might break on every new kernel release and even if they don’t change, your distro might have a good reason to change it. As unstable as the kernel ABI is, your eBPF programs are likely too.

Inserting new fields or reordering them somehow is annoying enough, but it’s possible refactors take away fields you depend on all together and then you don’t just have to recompile your program, but solve your problem again from the beginning. This might incur a significant maintenance burden, especially considering this is usually not a particularly simple task.

__randomize_layout

I purposely left it in the definition of renamedata above to get you guessing! Many structures in the Linux kernel are marked with __randomize_layout so that its members are shuffled during compile time. This makes certain exploits more difficult as the offsets to critical structures are unknown.

Unfortunately we depend on some of those offsets! The full version of my small struct mount above is actually marked __randomize_layout as well. Meaning, if struct randomization is enabled in your kernel, there is no way to implement this full path traversal properly in an eBPF program.

Fortunately (?) many distros (such as Debian and Ubuntu) disable struct randomization. You can check this by running grep CONFIG_RANDSTRUCT /boot/config-$(uname -r), which should return CONFIG_RANDSTRUCT_NONE=y if it is disabled.

But if you use a distro that has it enabled or have some requirements to enable it, then certain things will be very difficult (or impossible) to do in an eBPF program.

I have later realized that smarter people are aware of this issue. You can use BTF and CO-RE (Compile Once - Run Everywhere) to generate headers with the actual struct layouts. I will only need the file system monitoring on distros that have struct randomization disabled, so I haven’t looked into it yet.

Conclusion

eBPF can be used for high performance file system monitoring, but it is possibly unstable and not very portable. If you control the target system very well, that might not be a huge problem. Also it is definitely some trouble to collect all the possible syscalls that might end up modifying a file on your disk and I am not sure I got them all. Even if I did, new ways could be added tomorrow or in a year. It seems efficient and flexible, but fragile, which is a difficult tradeoff. It’s the kind of thing where the tradeoff might be worth it today, but with an ongoing maintenance burden, that tradeoff might slowly become less favorable.

I am sure I should look into CO-RE and BTF because they alleviate some of my concerns, but I don’t think it will clear them all up.