The file table: osquery's secret weapon

The file table: osquery's secret weapon

Fritz Ifert-Miller by Fritz Ifert-Miller on

In this article, we’ll take a deep dive into osquery tables, by talking about osquery’s ‘file table’ which has some powerful abilities, along with its fair share of limitations.

This article is part of an ongoing series in which we look at osquery tables and discuss what they can and cannot do. Check out our previous blog, “How To Write a New Osquery Table” for a great place to get started in the basics of osquery.

In this article, we’ll be taking a slightly deeper dive into osquery tables, by talking about osquery’s file table, which has some powerful abilities, as well as its fair share of limitations.

Specifically, this post will look at a cross-platform way to find files across your infrastructure.

The file table in osquery

The file table in osquery has an incredible degree of utility across many queries, and represents a fundamental cornerstone of osquery’s core capabilities.

Let’s start by taking a look at its basic schema:

+---------------+---------+---------------------------------------+
| COLUMN        | TYPE    | DESCRIPTION                           |
+---------------+---------+---------------------------------------+
| path          | TEXT    | Absolute file path                    |
| directory     | TEXT    | Directory of file(s)                  |
| filename      | TEXT    | Name portion of file path             |
| inode         | BIGINT  | Filesystem inode number               |
| uid           | BIGINT  | Owning user ID                        |
| gid           | BIGINT  | Owning group ID                       |
| mode          | TEXT    | Permission bits                       |
| device        | BIGINT  | Device ID (optional)                  |
| size          | BIGINT  | Size of file in bytes                 |
| block_size    | INTEGER | Block size of filesystem              |
| atime         | BIGINT  | Last access time                      |
| mtime         | BIGINT  | Last modification time                |
| ctime         | BIGINT  | Last status change time               |
| btime         | BIGINT  | (B)irth or (cr)eate time              |
| hard_links    | INTEGER | Number of hard links                  |
| symlink       | INTEGER | 1 if the path is a symlink, else 0    |
| type          | TEXT    | File status                           |
| attributes    | TEXT    | File attrib string.                   |
| volume_serial | TEXT    | Volume serial number                  |
| file_id       | TEXT    | file ID                               |
+---------------+---------+---------------------------------------+

As we can see, there are many metadata attributes that we can use to our advantage both when building queries and refining results. Let’s run a sample query against the file table to inspect a file on our local device:

SELECT *
FROM file
WHERE path = '/Users/fritz-imac/Downloads/github-recovery-codes.txt';

This query returns the following data:

path = /Users/fritz-imac/Downloads/github-recovery-codes.txt
directory = /Users/fritz-imac/Downloads
filename = github-recovery-codes.txt
inode = 20650405
uid = 502
gid = 20
mode = 0644
device = 0
size = 206
block_size = 4194304
atime = 1533646421
mtime = 1532976585
ctime = 1532976860
btime = 1532976585
hard_links = 1
symlink = 1
type = regular

Very cool! We can see that we have a fair bit of information about this github-recovery-codes.txt file. Now, let’s take a moment to clean it up and return the values in a format that we can quickly parse.

Osquery, by default, returns some data in a less than humanly digestible format.

  • Time is represented in UNIX epoch

  • Size is defined in bytes.1

Unix epoch date-times are great because they are the easiest to transform. Using the datetime(value,'unixepoch') syntax, we can convert any date-time result in osquery to an easier read value. Then, we will round our size from bytes to megabytes using the ROUND function ROUND((f.size * 10e-7),4).

Finally, we will join our uid and gid on their respective tables (users and groups) and return their actual names. Let’s try it below:

SELECT
  f.path,
  u.username AS file_owner,
  g.groupname AS group_owner,
  datetime(f.atime, 'unixepoch') AS file_last_access_time,
  datetime(f.mtime, 'unixepoch') AS file_last_modified_time,
  datetime(f.ctime, 'unixepoch') AS file_last_status_change_time,
  datetime(f.btime, 'unixepoch') AS file_created_time,
  ROUND(
    (f.size * 10e - 7),
    4
  ) AS size_megabytes
FROM
  file f
  LEFT JOIN users u ON f.uid = u.uid
  LEFT JOIN groups g ON f.gid = g.gid
WHERE
  path = "/Users/fritz-imac/Downloads/github-recovery-codes.txt"

Running this query results in the following output:

path = /Users/fritz-imac/Downloads/github-recovery-codes.txt
file_owner = fritz-imac
group_owner = staff
file_last_access_time = 2018-08-29 18:43:40
file_last_modified_time = 2018-07-30 18:49:45
file_last_status_change_time = 2018-07-30 18:54:20
file_created_time = 2018-07-30 18:49:45
size_megabytes = 0.0002

Already that is much easier to read! Now we can go hunting for any file we want. But slow down, tiger, there is a big caveat with the file table in osquery.

You have to know where your file is first!

Unfortunately, the file table requires a WHERE clause, meaning that you need to know roughly where an item is before you can go querying for it. This protects against the massive recursion that would be necessitated by searching every single directory and their respective subdirectories on the file system.

The clause WHERE path = can thankfully be massaged through the use of wildcards and a LIKE argument, e.g., WHERE path LIKE "/foo/%" but there are some tricky things to look out for in terms of the way wildcards are handled!

Understanding how single wildcards ‘%’ work within the file table

SELECT * FROM file WHERE path LIKE "/%";

Single wildcards treat the file table like an ogre, in that it’s like an onion: full of layers.

A single % in the file table allows you to find items in a specified layer of the file system. For instance:

SELECT
  *
FROM
  file
WHERE
  path LIKE "/Users/%/Google Drive/%/%";

This query would find any files that were located within directories like:

/Users/username/Google Drive/foo/bar/
/Users/username2/Google Drive/bar/foo/

It will not, however, return any files that were located in subsequent deeper subdirectories. For instance:

/Users/username/Google Drive/foo/bar/baz/filename.ext
/Users/username/Google Drive/foo/bar/baz/qux/quux/filename.ext

This means you can’t print out every file on your file system by running:

SELECT
  *
FROM
  file
WHERE
  path LIKE "/%"

As long as you only use single % wildcards, you will only ever be able to see all files at that layer of the subdirectory hierarchy. So in this instance, the first layer, items in the root directory:

osquery> SELECT path, btime, size, type FROM file WHERE path LIKE "/%";
+----------------------------+------------+------+-----------+
| path                       | btime      | size | type      |
+----------------------------+------------+------+-----------+
| /Applications/             | 1508949222 | 4114 | directory |
| /Library/                  | 1508949320 | 2312 | directory |
| /Network/                  | 1469907150 | 68   | directory |
| /System/                   | 1508949086 | 136  | directory |
| /Users/                    | 1469911752 | 170  | directory |
| /Volumes/                  | 1469907156 | 136  | directory |
| /bin/                      | 1508949475 | 1292 | directory |
| /cores/                    | 1469907149 | 68   | directory |
| /dev/                      | 0          | 4444 | directory |
| /etc/                      | 1508949384 | 4114 | directory |
| /home/                     | 0          | 1    | directory |
| /installer.failurerequests | 1504228163 | 313  | regular   |
| /net/                      | 0          | 1    | directory |
| /private/                  | 1519081368 | 204  | directory |
| /sbin/                     | 1508949475 | 2142 | directory |
| /tmp/                      | 1519081368 | 1326 | directory |
| /usr/                      | 1508948533 | 306  | directory |
| /var/                      | 1510954365 | 986  | directory |
+----------------------------+------------+------+-----------+

Understanding how double wildcards ‘%%’ work within the file table

To recursively search your filesystem, you can use %% the double wildcard.

-- WARNING: Do NOT run the following query it will literally return thousands of
-- files on your computer!!!
SELECT
  *
FROM
  file
WHERE
  path LIKE "/%%"

Double wildcards can only ever be used at the end of a string eg. /foo/bar/%%.

Double wildcards can NEVER be used mid-string (infix)

This means the following query will never return results:

SELECT
  path
from
  file
WHERE
  path LIKE "/Users/%%/UX/%%";

However, you can use the double wildcard to your advantage in limited situations where you want to search within a known parent directory and its respective sub-directories. The more possible sub-directories and recursion, the slower and less performant the query will be.

SELECT
  path
FROM
  file
WHERE
  path LIKE "/Users/fritz-imac/Google Drive/UX/%%";

Mixing single ‘%’ and double wildcards ‘%%’ within the file table

The last fun thing you can do is mix wildcard types, if you want to restrict certain parts of your WHERE clause to a specific layer and permit recursion at another part of the path.

For instance, if I wanted to see how much percentage of disk space the files located in Google Drive are using per device across my entire fleet, I could run the following query:

SELECT SUM(file.size) AS gdrive_size,
  (SELECT (mounts.blocks * mounts.blocks_size)
   FROM mounts
   WHERE PATH = '/') AS total_disk_size,
       (100.0 * SUM(file.size) /
          (SELECT (mounts.blocks * mounts.blocks_size)
           FROM mounts
           WHERE PATH = '/')) AS gdrive_percentage_used
FROM FILE
WHERE file.path LIKE '/Users/%/Google Drive/%%';

Which produces the following output:

+-------------+-----------------+------------------------+
| gdrive_size | total_disk_size | gdrive_percentage_used |
+-------------+-----------------+------------------------+
| 7492526705  | 379000430592    | 1.97691772890512       |
+-------------+-----------------+------------------------+

These queries can be run, but you need to be mindful of how many files your system is looking through to produce the end result.

Limitations of wildcard LIKE searches in the file table

While undeniably useful, the file table is not without limitations. Knowing what those limitations are and how to avoid them will help you make sure you are accurately returning your expected data.

It is worth noting that both of these limitations are outlined in a GitHub issue, in the osquery repository #7306, and may (someday) be addressed.๐Ÿคž๐Ÿป

If you were to run a query like:

SELECT * FROM file WHERE path LIKE '/%%'

You might expect that it would return every file on your file-system, but in practice it does not. This is because osquery will end the query once it encounters something called a symlink loop.

osquery> SELECT COUNT(*) FROM file WHERE path LIKE '/%%';
+----------+
| COUNT(*) |
+----------+
| 1172     |
+----------+

Symlinks (also known as symbolic links) are frequently referred to as shortcuts. They are files which point to another file or folder on the computer.

A symlink loop is when a symlink points back to a parent directory in its path which creates an infinite loop if traversed. We can observe this behavior in practice by querying a folder which contains one of these loops:

As we can see in the screenshot below, the directory /tmp/directory-level-0/ consists of 6 nested folders.

A screenshot showing the symbolic link loop.
Screenshot of macOS showing nested directories

The tree command shows the expected output of our nested directories:

โžœ  directory-level-0 tree
.
โ””โ”€โ”€ directory-level-1
    โ”œโ”€โ”€ directory-level-2
    โ”‚   โ””โ”€โ”€ directory-level-3
    โ”‚       โ””โ”€โ”€ directory-level-4
    โ”‚           โ””โ”€โ”€ directory-level-5
    โ”‚               โ””โ”€โ”€ directory-level-6
    โ””โ”€โ”€ symlink-back-to-level-0 -> /tmp/directory-level-0

However, when we recursively query the parent folder using osquery, we receive an incomplete result. This is because the symbolic link is causing a recursive loop of a child pointing back to a parent, which osquery avoids by preemptively terminating the query:

osquery> SELECT path FROM file WHERE path LIKE '/tmp/directory-level-0/%%';
+-------------------------------------------------------------------------------------+
| path                                                                                |
+-------------------------------------------------------------------------------------+
| /tmp/directory-level-0/directory-level-1/                                           |
| /tmp/directory-level-0/directory-level-1/directory-level-2/                         |
| /tmp/directory-level-0/directory-level-1/directory-level-2/directory-level-3/       |
| /tmp/directory-level-0/directory-level-1/symlink-back-to-level-0/                   |
| /tmp/directory-level-0/directory-level-1/symlink-back-to-level-0/directory-level-1/ |
+-------------------------------------------------------------------------------------+

Unfortunately, the recursive searching in the osquery file table is accomplished via glob, which has no method for skipping over symlinks.

As a result you have to take a targeted approach when recursively searching. Use double wildcards to avoid recursing through paths that may contain symlinks.

Hidden files are not returned by recursive searches

If a file is marked as hidden in the file-system (typically by prepending the filename with a “."), it will not be returned in the results of any recursive file search. This can be observed in the following example query:

osquery> SELECT * FROM file WHERE path LIKE '/Users/test-macbook/git/kolide/test/%' AND filename = '.git';

osquery>

osquery> SELECT path FROM file WHERE path = '/Users/test-macbook/git/kolide/test/.git';
+------------------------------------------+
| path                                     |
+------------------------------------------+
| /Users/test-macbook/git/kolide/test/.git |
+------------------------------------------+

osquery> SELECT path FROM file WHERE directory = '/Users/test-macbook/git/kolide/test' AND filename = '.git';
+------------------------------------------+
| path                                     |
+------------------------------------------+
| /Users/test-macbook/git/kolide/test/.git |
+------------------------------------------+

Keep this in mind if you are searching for files that might be marked as hidden in the filesystem. Otherwise, you may be missing results that you expected.

When should I use recursive queries in the file table in osquery?

Basically, there are two times when it makes sense to use recursive queries in the file table:

  • When there is literally no other option for locating a file of import.

  • When you can limit the degree of recursion that your search is capable of by scoping it to a particular folder, e.g., ('/Users/%/Downloads/%%')

If there is ever any doubt on how expensive a query is after you schedule it, osquery comes with great tools to analyze just that. For example, in a product like 1Passwordยฎ Extended Access Management, you can run a live query against your devices that enumerates all of the queries in your schedule by simply running:

SELECT * FROM osquery_schedule
A screenshot showing the XAM osquery schedule.
Live Query allows you to run osquery SQL across all of your enrolled devices in 1Password Extended Access Management, and immediately get the results in the browser or via API.

The screenshot shows a line-up of sample queries in our schedule. We can short all of those queries by the wall_time to get a sense of how long these take to complete. We can even see how much memory they may be using during execution. These are all indicators to help us avoid executing file queries with too much recursion potential.

Essentially, file tables are a fantastic tool in osquery โ€“ so long as you use them wisely. And if you’d like an out-of-the-box osquery solution for managing your company’s endpoints, why not check out 1Password Extended Access Management? It does all of this, plus a whole lot more.

If you’d like to read more content like this, sign up for our developer newsletter.


  1. Osquery sizes are returned in bytes but some comp-sci items (RAM vs Hard Drives) calculate their size in base-10 and some calculate in base-2. ↩︎

Principal Product Manager

Fritz Ifert-Miller - Principal Product Manager Fritz Ifert-Miller - Principal Product Manager

Tweet about this post