Episode 93 — Disk full puzzles: filesystem full vs inode exhaustion vs runaway logs

In Episode Ninety-Three, we confront one of the most common yet deceptive challenges in system administration, which is the "disk full" alarm that can bring critical services to a sudden halt. As a seasoned educator in the cybersecurity space, I have observed that a disk reporting zero bytes available is rarely a simple matter of having too many large files. Sometimes the storage is physically empty but logically full due to a shortage of metadata pointers, or space is being held hostage by a process that refuses to release a handle to a file that was already deleted. To maintain a professional-grade infrastructure, you must be able to solve these puzzles by identifying which specific resource has actually been exhausted. Today, we will break down the mechanics of storage consumption and metadata limits to provide you with a structured framework for achieving absolute system availability.

Before we continue, a quick note: this audio course is a companion to our Linux Plus books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

To establish a professional foundation, you must resolve disk alarms by identifying exactly which resource is exhausted through a methodical verification of the filesystem. Your first step should always be to check the overall filesystem capacity and compare it to recent growth patterns to determine if this is a gradual accumulation or a sudden spike. A seasoned educator will remind you that a disk that fills up in minutes is usually the victim of a runaway process, while one that fills up over months is a capacity planning issue. You should look for sudden deviations from your established baseline to understand the velocity of the consumption. Recognizing the difference between a steady climb and a vertical spike is the foundational step in moving from a temporary cleanup to a permanent architectural fix.

You must specifically check for inode usage when the capacity seems to be available but writes continue to fail with a "no space left on device" error. An index node, or inode, is a data structure on a Linux filesystem that stores information about a file, and every file requires exactly one inode regardless of how small it is. If a malfunctioning script or a mail server creates millions of empty zero-byte files, you will run out of these metadata pointers long before you run out of actual gigabytes. You should use specialized flags with your disk reporting tools to verify if your metadata table is at one hundred percent capacity. A cybersecurity professional knows that inode exhaustion is a common symptom of poorly managed session directories or cache folders that lack a proper automated cleanup routine.

When hunting for the cause of a full disk, you must recognize runaway logs and unrotated files as the most common culprits in an enterprise environment. A misconfigured application logging at a debug level can generate several gigabytes of text in a single hour, quickly overwhelming even the largest partitions if the rotation service is not configured correctly. You should be prepared to scan the filesystem for these growth patterns, looking for files that are significantly larger than their peers or those that have grown rapidly since the last check. Unrotated logs not only consume space but also make forensic analysis difficult because the files become too large for standard text editors to open. Mastering the identification of these "log bombs" is what allows you to restore service quickly without deleting essential application data.

To find the specific location of the problem, you must find large directories by scanning for the top space consumers across your various mount points. You should use tools that can summarize the usage of each directory tree, allowing you to "drill down" from the root of the filesystem to the specific folder that contains the bulk of the data. This visual or numeric breakdown is essential because it prevents you from wasting time cleaning up small, irrelevant files in the wrong location. A seasoned educator will tell you that the "eighty-twenty rule" often applies here; eighty percent of the space is usually consumed by twenty percent of the directories. Identifying the offending directory tree is the most efficient way to target your remediation efforts during a high-pressure system outage.

One of the most elusive puzzles involves identifying open deleted files that are still consuming hidden disk space despite no longer appearing in the directory structure. When you delete a file that is still being written to or read by an active process, the directory entry is removed, but the disk blocks are not freed until that process closes the file handle. This often happens when an administrator deletes a massive log file to "free up space," only to find that the disk usage reported by the system does not change at all. You can use specialized tools to list open files that have a link count of zero to find these "ghost" consumers. A professional administrator knows that restarting the service or truncating the file is often required to actually reclaim that trapped space.

In more modern environments, you should also consider the impact of caches and temporary files that are created by failed or hung jobs. If a backup script fails halfway through its routine, it may leave behind a massive temporary archive in the temporary directory that does not show up in standard application reports. These files often lack descriptive names and can be easily overlooked during a manual search if you are only looking for known application data. You should audit the standard temporary locations and the package manager caches to see if a failed update or an interrupted download is the source of the bloat. Recognizing these "orphaned" files is a vital part of your responsibility as a senior technical expert who maintains system cleanliness.

You must also recognize snapshot growth as a source of hidden usage on modern copy-on-write filesystems like Zero-byte File System or B-Tree File System. If you take a snapshot of a database and then delete the original data to make room, the disk space will not be freed because the snapshot is still holding those specific blocks to preserve the historical state. The usage might not appear in standard directory scans because the "files" are actually stored in the hidden metadata of the snapshot layer. This can lead to a confusing situation where the sum of all files on the disk is much smaller than the reported "used" space. Understanding the persistence of snapshots is essential for managing storage on advanced volumes where traditional tools may provide an incomplete picture.

Let us practice a recovery scenario where you cannot create a new file, and you must decide whether to check the raw capacity or the inode count first. Your first move should be to check the human-readable capacity; if it shows fifty percent free, you must immediately suspect metadata exhaustion and check the inode usage. If the inodes are at one hundred percent, you have found your culprit, and you should begin searching for a directory containing millions of very small files. Second, you would identify the process creating these files and stop it before attempting to clear the metadata table. This methodical investigation of capacity versus metadata is how you solve disk puzzles with professional authority rather than guessing.

A vital technical rule for any professional administrator is to free space safely by preferring the removal of old logs and temporary caches over active data. You should never "blindly" delete files you do not recognize; instead, use the package manager's cleaning tools or the system's own log rotation commands to reclaim space in a supported manner. Compressing old text files is often a safer first step than outright deletion because it preserves the historical record while drastically reducing the footprint on the disk. A seasoned educator will remind you that the goal is to "relieve the pressure" on the filesystem so that the system can function while you look for a permanent solution. Protecting the integrity of the remaining data is more important than a fast, reckless cleanup.

Once the immediate crisis is averted, you must fix the underlying cause by adjusting the log rotation, the data retention policies, and the application verbosity settings. If a disk filled up because of runaway logs, you must ensure that the rotation service is configured to compress files more aggressively or to keep fewer historical copies. You should also check if an application was accidentally left in "debug" mode, which causes it to write far more data than is necessary for standard production operations. Setting up automated alerts that trigger at eighty percent usage allows you to intervene long before the system hits the critical one hundred percent mark. Implementing these "preventative" controls is what separates a reactive technician from a proactive system administrator.

To help you remember these recovery building blocks during a high-pressure incident, you should use a simple memory hook: space, inodes, hidden, then logs. First, check the raw "space" available; second, check the "inode" count for metadata exhaustion; third, look for "hidden" open handles to deleted files; and finally, audit the "logs" for runaway growth. By keeping this logic-driven distinction in mind, you can quickly categorize any storage issue and reach for the correct technical tool to solve it. This mental model is a powerful way to organize your technical knowledge and ensure you are always managing the right part of the storage stack. It ensures that you do not miss the "invisible" causes of a full disk while focusing only on the obvious ones.

For a quick mini review of this episode, can you name three specific reasons why a disk might fill up unexpectedly overnight? You should recall that runaway logging due to a configuration error, inode exhaustion from millions of small files, and open handles to deleted files are the primary technical suspects. Each of these scenarios presents a "full disk" symptom but requires a completely different diagnostic command to uncover the truth. By internalizing these threats to availability, you are preparing yourself for the real-world engineering and leadership tasks that define a technical expert. Understanding that a disk can be "full" in multiple ways is essential for passing the exam and maintaining your fleet of servers.

As we reach the conclusion of Episode Ninety-Three, I want you to describe your cleanup order and one prevention step for a server that frequently runs out of space. Your first step should be to identify the largest directories and clear the package manager caches, followed by the truncation of any large, unneeded logs. Finally, you should implement a stricter retention policy in the log rotation configuration to ensure the problem does not recur next week. By verbalizing your strategic choice, you are demonstrating the professional integrity and the technical mindset required for the certification and a successful career in cybersecurity. Managing disk puzzles is the ultimate exercise in professional system resilience and long-term environmental accountability.

Episode 93 — Disk full puzzles: filesystem full vs inode exhaustion vs runaway logs
Broadcast by