Category Archives: Linux Device Drivers

LFY series on Linux device drivers

The Semester Project – Part VII: File System in Action

This twenty-fourth article, which is part of the series on Linux device drivers, gets the complete real SIMULA file system module in action, with a real hardware partition on your pen drive.

<< Twenty-third Article

Real SFS in action

Code available from rsfs_in_action_code.tbz2 gets to the final tested implementation of the final semester project of Pugs & Shweta. This contains the following:

real_sfs.c – contains the code of earlier real_sfs_minimal.c plus the remaining real SIMULA file system functionalities. Note the file system’s name change from sfs to real_sfs
real_sfs_ops.c & real_sfs_ops.h – contains the earlier code plus the additional operations needed for the enhanced real_sfs.c implementations
real_sfs_ds.h (almost same file as in the previous article, plus a spin lock added into the real SFS info structure to be used for preventing race conditions in accessing the used_blocks array in the same structure)
format_real_sfs.c (same file as in the previous articles) – the real SFS formatter application
Makefile – contains the rules for building the driver real_sfs_final.ko using the real_sfs_*.* files, and the format_real_sfs application using the format_real_sfs.c

With all these and earlier details, Shweta completed their project documentation. And so finally, Shweta & Pugs were all set for their final semester project demo, presentation and viva.

The highlights of their demo (on root shell) are as follows:

Loading real_sfs_final driver: insmod real_sfs_final.ko
Using the previously formatted pen drive partition /dev/sdb1 or Re-formatting it using the format_real_sfs application: ./format_real_sfs /dev/sdb1. Caution: Please check out the complete detailed steps on this from the previous article, before you actually format it
Mount the real SFS formatted partition: mount -t real_sfs /dev/sdb1 /mnt
And … And what? Browse the mounting filesystem. Use your usual shell commands to operate on the file system: ls, cd, touch, vi, rm, chmod, …

Figure 40 shows the real SIMULA file system in action

Figure 40: The real SIMULA file system module in action

Realities behind the action

And if you really want to know, what are the additional enhancements Pugs added to the previous article’s code to get to this level, it is basically the following core system calls as part of the remaining 4 out of 5 set of structures of function pointers (in real_sfs.c):

write_inode (under struct super_operations) – sfs_write_inode() basically gets a pointer to an inode in the VFS’ inode cache, and is expected to sync that with the inode in physical hardware space file system. That is achieved by calling the appropriately modified sfs_update() (defined in real_sfs_ops.c) (adapted from the earlier browse_real_sfs application). The key parameter changes being passing the inode number instead of the filename and the actual timestamp instead of the flag for its update status. And accordingly, the call to filename based sfs_lookup() is being replaced by inode number based sfs_get_file_entry() (defined in real_sfs_ops.c), and additionally now the data blocks are also being freed (using sfs_put_data_block() (defined in real_sfs_ops.c)), if the file size has reduced. Note that sfs_put_data_block() (defined in real_sfs_ops.c) is a transformation of the put_data_block() from the browse_real_sfs application. Also, note the interesting mapping to / from the VFS inode number from / to our zero-based file entry indices, using the macros S2V_INODE_NUM() / V2S_INODE_NUM() in real_sfs_ops.h.
And finally, underlying write is being achieved using write_to_real_sfs(), a function added in real_sfs_ops.c, very similar to read_from_real_sfs() (already there in real_sfs_ops.c), except the direction reversal of the data transfer and marking the buffer dirty to be synced up with the physical content. Alongwith, in real_sfs_ops.c, two wrapper functions read_entry_from_real_sfs() (using read_from_real_sfs()) and write_entry_to_real_sfs() (using write_to_real_sfs()) have been written and used respectively for the specific requirements of reading and writing the file entries, to increase the code readability. sfs_write_inode() and sfs_update() are shown in the code snippet below. sfs_write_inode() has been written in the file real_sfs.c. For others, check out the file real_sfs_ops.c

#if (LINUX_VERSION_CODE < KERNEL_VERSION(2,6,34))
static int sfs_write_inode(struct inode *inode, int do_sync)
#else
static int sfs_write_inode(struct inode *inode, struct writeback_control *wbc)
#endif
{
	sfs_info_t *info = (sfs_info_t *)(inode->i_sb->s_fs_info);
	int size, timestamp, perms;

	printk(KERN_INFO "sfs: sfs_write_inode (i_ino = %ld)\n", inode->i_ino);

	if (!(S_ISREG(inode->i_mode))) // Real SFS deals only with regular files
		return 0;

	size = i_size_read(inode);
	timestamp = inode->i_mtime.tv_sec > inode->i_ctime.tv_sec ?
			inode->i_mtime.tv_sec : inode->i_ctime.tv_sec;
	perms = 0;
	perms |= (inode->i_mode & (S_IRUSR | S_IRGRP | S_IROTH)) ? 4 : 0;
	perms |= (inode->i_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) ? 2 : 0;
	perms |= (inode->i_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) ? 1 : 0;

	printk(KERN_INFO "sfs: sfs_write_inode with %d bytes @ %d secs w/ %o\n",
		size, timestamp, perms);

	return sfs_update(info, inode->i_ino, &size, &timestamp, &perms);
}

int sfs_update(sfs_info_t *info, int vfs_ino, int *size, int *timestamp, int *perms)
{
	sfs_file_entry_t fe; 
	int i;
	int retval;

	if ((retval = sfs_get_file_entry(info, vfs_ino, &fe)) < 0) 
	{   
		return retval; 
	}   
	if (size) fe.size = *size;
	if (timestamp) fe.timestamp = *timestamp;
	if (perms && (*perms <= 07)) fe.perms = *perms;

	for (i = (fe.size + info->sb.block_size - 1) / info->sb.block_size;
		i < SIMULA_FS_DATA_BLOCK_CNT; i++)
	{   
		if (fe.blocks[i])
		{   
			sfs_put_data_block(info, fe.blocks[i]);
			fe.blocks[i] = 0;
		}   
	}   

	return write_entry_to_real_sfs(info, V2S_INODE_NUM(vfs_ino), &fe);
}

create, unlink, lookup (under struct inode_operations) – All the 3 functions sfs_inode_create(), sfs_inode_unlink(), sfs_inode_lookup() have the 2 common parameters (the parent’s inode pointer and the pointer to the directory entry for the file in consideration), and these respectively create, delete, and lookup an inode corresponding to their directory entry pointed by their parameter, say dentry.
sfs_inode_lookup() basically searches for the existence of the filename underneath using the appropriately modified sfs_lookup() (defined in real_sfs_ops.c) (adapted from the earlier browse_real_sfs application – the adoption being replacing the user space lseek()+read() combo by the read_entry_from_real_sfs()). If filename is not found, then it invokes the generic kernel function d_splice_alias() to create a new inode entry in the underlying file system, for the same, and then attaches it to the directory entry pointed by dentry. Otherwise, it just attaches the inode from VFS’ inode cache (using generic kernel function d_add()). This inode, if obtained fresh (I_NEW), needs to be filled in with the real SFS looked up file attributes. In all the above implementations and in those to come, a few basic assumptions have been made, namely:
- Real SFS maintains mode only for user and that is mapped to all 3 of user, group, other of the VFS inode
- Real SFS maintains only one timestamp and that is mapped to all 3 of created, modified, accessed times of the VFS inode.
sfs_inode_create() and sfs_inode_unlink() correspondingly invokes the transformed sfs_create() and sfs_remove() (defined in real_sfs_ops.c) (adapted from the earlier browse_real_sfs application), for respectively creating and clearing the inode entries at the underlying hardware space file system, apart from the usual inode cache operations, using new_inode()+insert_inode_locked(), d_instantiate() & inode_dec_link_count(), instead of the earlier learnt iget_locked(), d_add(). Apart from the permissions and file entry parameters, and replacing lseek()+read() combo by read_entry_from_real_sfs(), sfs_create() has an interesting transformation from user space to kernel space: time(NULL) to get_seconds(). And in both of sfs_create() and sfs_remove() the user space lseek()+write() combo has been replaced by the obvious write_entry_to_real_sfs(). Check out all the above mentioned code pieces in the files real_sfs.c and real_sfs_ops.c, as mentioned.
readpage, write_begin, writepage, write_end (under struct address_space_operations) – All these address space operations are basically to read and write blocks on the underlying filesystem, and are achieved using the respective generic kernel functions mpage_readpage(), block_write_begin(), block_write_full_page(), generic_write_end(). First one is prototyped in <linux/mpage.h> and remaining 3 in <linux/buffer_head.h>. Now, though these functions are generic enough, a little thought would show that the first three of these would ultimately have to do a real SFS specific transaction with the underlying block device (the hardware partition), using the corresponding block layer APIs. And that exactly is achieved by the real SFS specific function sfs_get_block(), which is being passed into and used by the first three functions, mentioned above.
sfs_get_block() (defined in real_sfs.c) is invoked to read a particular block number (iblock) of a file (denoted by an inode), into a buffer head (bh_result), optionally fetching (allocating) a new block. So for that, the block array of corresponding real SFS inode is looked up into and then the corresponding block of the physical partition is fetched using the kernel API map_bh(). Again note that to fetch a new block, we invoke the sfs_get_data_block() (defined in real_sfs_ops.c), which is again a transformation of the get_data_block() from the browse_real_sfs application. Also, in case of a new block allocation, the real SFS inode is also updated underneath, using sfs_update_file_entry() – a one liner implementation in real_sfs_ops.c. Code snippet below shows the sfs_get_block() implementation.

static int sfs_get_block(struct inode *inode, sector_t iblock,
				struct buffer_head *bh_result, int create)
{
	struct super_block *sb = inode->i_sb;
	sfs_info_t *info = (sfs_info_t *)(sb->s_fs_info);
	sfs_file_entry_t fe;
	sector_t phys;
	int retval;

	printk(KERN_INFO "sfs: sfs_get_block called for I: %ld, B: %llu, C: %d\n",
		inode->i_ino, (unsigned long long)(iblock), create);

	if (iblock >= SIMULA_FS_DATA_BLOCK_CNT)
	{
		return -ENOSPC;
	}
	if ((retval = sfs_get_file_entry(info, inode->i_ino, &fe)) < 0)
	{
		return retval;
	}
	if (!fe.blocks[iblock])
	{
		if (!create)
		{
			return -EIO;
		}
		else
		{
			if ((fe.blocks[iblock] = sfs_get_data_block(info)) ==
				INV_BLOCK)
			{
				return -ENOSPC;
			}
			if ((retval = sfs_update_file_entry(info, inode->i_ino, &fe))
				< 0) 
			{   
				return retval;
			}
		}
	}
	phys = fe.blocks[iblock];
	map_bh(bh_result, sb, phys);

	return 0;
}

open, release, read, write, aio_read/read_iter (since kernel v3.16), aio_write/write_iter (since kernel v3.16), fsync (under a regular file’s struct file_operations) – All these operations should basically go through the buffer cache, i.e. should be implemented using the address space operations. And this being a common requirement, the kernel provides a generic set of kernel APIs, namely generic_file_open(), do_sync_read()/new_sync_read() (since kernel v3.16), do_sync_write()/new_sync_write() (since kernel v3.16), generic_file_aio_read()/generic_file_read_iter() (since kernel v3.16), generic_file_aio_write()/generic_file_write_iter() (since kernel v3.16), simple_sync_file()/noop_fsync() (since kernel v2.6.35). Moreover, the address space operations read & write are no longer required to be given since kernel v4.1. Note that there is no API for release, as it is a ‘return 0‘ API. Check out the real_sfs.c file for the exact definition of struct file_operations sfs_fops.
readdir/iterate (since kernel v3.11) (under a directory’s struct file_operations) – sfs_readdir()/sfs_iterate() primarily reads the directory entries of an underlying directory (denoted by file), and fills it up into the VFS directory entry cache (pointed by dirent parameter) using the parameter function filldir, or into the directory context (pointed by ctx parameter) (since kernel v3.11).
As real SFS has only one directory, the top one, this function basically fills up the directory entry cache with directory entries for all the files in the underlying file system, using the transformed sfs_list() (defined in real_sfs_ops.c), adapted from the browse_real_sfs application. Check out the real_sfs.c file for the complete sfs_readdir()/sfs_iterate() implementation, which starts with filling directory entries for ‘.‘ (current directory) and ‘..‘ (parent directory) using parameter function filldir(), or generic kernel function dir_emit_dots() (since kernel v3.11), and then for all the files of the real SFS, using sfs_list().
put_super (under struct super_operations) – The previous custom implementation sfs_kill_sb() (pointed by kill_sb) has been enhanced by separating it into the custom part being put into sfs_put_super() (and now pointed by put_super), and the standard kill_block_super() being directly pointed by kill_sb. Check out the file real_sfs.c for all these changes.

With all these in place, one could see the amazing demo by Pugs in action, as shown in Figure 40. And don’t forget watching the live log in /var/log/messages using a ‘tail -f /var/log/messages‘, matching it with every command you issue on the mounted real SFS file system. This would give you the best insight into when does which system call gets called. Or, in other words which application invokes which system call from the file system front. For tracing all the system calls invoked by an application/command, use strace with the command, e.g. type ‘strace ls‘ instead of just ‘ls‘.

Notes

On distros like Ubuntu, you may find the log under /var/log/syslog instead of /var/log/messages

The Semester Project – Part VI: File System on Block Device

1 Reply

This twenty-third article, which is part of the series on Linux device drivers, enhances the previously written bare bone file system module, to connect with a real hardware partition.

<< Twenty-second Article

Since the last bare bone file system, the first thing which Pugs figured out was how to read from the underlying block device. Following is a typical way of doing it:

struct buffer_head *bh;

bh = sb_bread(sb, block); /* sb is the struct super_block pointer */
// bh->b_data contains the data
// Once done, bh should be released using:
brelse(bh);

To do the above and various other real SFS (Simula File System) operations, Pugs’ felt a need to have his own handle to be a key parameter, which he added as follows (in previous real_sfs_ds.h):

typedef struct sfs_info
{
	struct super_block *vfs_sb; /* Super block structure from VFS for this fs */
	sfs_super_block_t sb; /* Our fs super block */
	byte1_t *used_blocks; /* Used blocks tracker */
} sfs_info_t;

The main idea behind this was to put all required static global variables in a single structure, and point that by the private data pointer of the file system, which is s_fs_info pointer in the struct super_block structure. With that, the key changes in the fill_super_block() (in previous real_sfs_bb.c file) becomes:

Allocate the structure for the handle, using kzalloc()
Initialize the structure for the handle (through init_browsing())
Read the physical super block, verify and translate info from it into the VFS super block (through init_browsing())
Point it by s_fs_info (through init_browsing())
Update the VFS super block, accordingly

Accordingly, the error handling code would need to do the shut_browsing(info) and kfree(info). And, that would additionally also need to go along with the function corresponding to the kill_sb function pointer, defined in the previous real_sfs_bb.c by kill_block_super, called during umount.

Here’s the various code pieces:

#include <linux/slab.h> /* For kzalloc, ... */
...
static int sfs_fill_super(struct super_block *sb, void *data, int silent)
{
	sfs_info_t *info;

	if (!(info = (sfs_info_t *)(kzalloc(sizeof(sfs_info_t), GFP_KERNEL))))
		return -ENOMEM;
	info->vfs_sb = sb;
	if (init_browsing(info) < 0)
	{
		kfree(info);
		return -EIO;
	}
	/* Updating the VFS super_block */
	sb->s_magic = info->sb.type;
	sb->s_blocksize = info->sb.block_size;
	sb->s_blocksize_bits = get_bit_pos(info->sb.block_size);

	...
}

static void sfs_kill_sb(struct super_block *sb)
{
	sfs_info_t *info = (sfs_info_t *)(sb->s_fs_info);

	kill_block_super(sb);
	if (info)
	{
		shut_browsing(info);
		kfree(info);
	}
}

kzalloc() in contrast to kmalloc(), also zeroes out the allocated location. get_bit_pos() is a simple Pugs’ way to compute logarithm base 2, as follows:

static int get_bit_pos(unsigned int val)
{
	int i;

	for (i = 0; val; i++)
	{
		val >>= 1;
	}
	return (i - 1);
}

And init_browsing(), shut_browsing() are basically the transformations of the earlier user-space functions of browse_real_sfs.c into kernel-space code real_sfs_ops.c, prototyped in real_sfs_ops.h. This basically involves the following transformations:

“int sfs_handle” into “sfs_info_t *info“
lseek() & read() into the read from the block device using sb_bread()
calloc() into vmalloc() & then appropriate initialization by zeros.
free() into vfree()

Here’s the transformed init_browsing() and shut_browsing() in real_sfs_ops.c:

#include <linux/fs.h> /* For struct super_block */
#include <linux/errno.h> /* For error codes */
#include <linux/vmalloc.h> /* For vmalloc, ... */

#include "real_sfs_ds.h"
#include "real_sfs_ops.h"

int init_browsing(sfs_info_t *info)
{
	byte1_t *used_blocks;
	int i, j;
	sfs_file_entry_t fe;
	int retval;

	if ((retval = read_sb_from_real_sfs(info, &info->sb)) < 0)
	{
		return retval;
	}
	if (info->sb.type != SIMULA_FS_TYPE)
	{
		printk(KERN_ERR "Invalid SFS detected. Giving up.\n");
		return -EINVAL;
	}

	/* Mark used blocks */
	used_blocks = (byte1_t *)(vmalloc(info->sb.partition_size));
	if (!used_blocks)
	{
		return -ENOMEM;
	}
	for (i = 0; i < info->sb.data_block_start; i++)
	{
		used_blocks[i] = 1;
	}
	for (; i < info->sb.partition_size; i++)
	{
		used_blocks[i] = 0;
	}

	for (i = 0; i < info->sb.entry_count; i++)
	{
		if ((retval = read_from_real_sfs(info,
					info->sb.entry_table_block_start,
					i * sizeof(sfs_file_entry_t),
					&fe, sizeof(sfs_file_entry_t))) < 0)
		{
			vfree(used_blocks);
			return retval;
		}
		if (!fe.name[0]) continue;
		for (j = 0; j < SIMULA_FS_DATA_BLOCK_CNT; j++)
		{
			if (fe.blocks[j] == 0) break;
			used_blocks[fe.blocks[j]] = 1;
		}
	}

	info->used_blocks = used_blocks;
	info->vfs_sb->s_fs_info = info;
	return 0;
}
void shut_browsing(sfs_info_t *info)
{
	if (info->used_blocks)
		vfree(info->used_blocks);
}

Similarly, all other functions in browse_real_sfs.c would also have to be transformed, one by one. Also, note the read from the underlying block device is being captured by the two functions, namely read_sb_from_real_sfs() and read_from_real_sfs(), which are coded as follows:

#include <linux/buffer_head.h> /* struct buffer_head, sb_bread, ... */
#include <linux/string.h> /* For memcpy */

#include "real_sfs_ds.h"

static int read_sb_from_real_sfs(sfs_info_t *info, sfs_super_block_t *sb)
{
	struct buffer_head *bh;

	if (!(bh = sb_bread(info->vfs_sb, 0 /* Super block is the 0th block */)))
	{
		return -EIO;
	}
	memcpy(sb, bh->b_data, SIMULA_FS_BLOCK_SIZE);
	brelse(bh);
	return 0;
}
static int read_from_real_sfs(sfs_info_t *info, byte4_t block,
				byte4_t offset, void *buf, byte4_t len)
{
	byte4_t block_size = info->sb.block_size;
	byte4_t bd_block_size = info->vfs_sb->s_bdev->bd_block_size;
	byte4_t abs;
	struct buffer_head *bh;

	/*
	 * Translating the real SFS block numbering to underlying block device
	 * block numbering, for sb_bread()
	 */
	abs = block * block_size + offset;
	block = abs / bd_block_size;
	offset = abs % bd_block_size;
	if (offset + len > bd_block_size) // Should never happen
	{
		return -EINVAL;
	}
	if (!(bh = sb_bread(info->vfs_sb, block)))
	{
		return -EIO;
	}
	memcpy(buf, bh->b_data + offset, len);
	brelse(bh);
	return 0;
}

All the above code pieces put in together as the real_sfs_minimal.c (based on the file real_sfs_bb.c created earlier), real_sfs_ops.c, real_sfs_ds.h (based on the same file created earlier), real_sfs_ops.h, and a supporting Makefile, along with the previously created format_real_sfs.c application are available from rsfs_on_block_device_code.tbz2.

Real SFS on block device

Once compiled using make, getting the real_sfs_first.ko driver, Pugs didn’t expect it to be way different from the previous real_sfs_bb.ko driver, but at least now it should be reading and verifying the underlying partition. And for that he first tried mounting the usual partition of a pen drive to get an “Invalid SFS detected” message in dmesg output; and then after formatting it. Note the same error of “Not a directory”, etc as in previous article, still existing – as anyways it is still very similar to the previous bare bone driver – the core functionalities yet to be implemented – just that it is now on a real block device partition. Figure 39 shows the exact commands for all these steps.

Figure 39: Connecting the SFS module with the pen drive partition

Note: “./format_real_sfs” and “mount” commands may take lot of time (may be in minutes), depending on the partition size. So, preferably use a partition, say less than 1MB.

With this important step of getting the file system module interacting with the underlying block device, the last step for Pugs would be to do the other transformations from browse_real_sfs.c and accordingly use them in the SFS module.

Twenty-fourth Article >>

The Semester Project – Part V: File System Module Template

2 Replies

This twenty-second article, which is part of the series on Linux device drivers, lays out a bare bone file system module.

<< Twenty-first Article

With the formatting of the pen drive, the file system is all set in the hardware space. Now, it is the turn to decode that using a corresponding file system module in the kernel space, and accordingly provide the user space file system interface, for it to be browsed like any other file systems.

The 5 sets of System Calls

Unlike character or block drivers, the file system drivers involve not just one structure of function pointers, but instead 5 structures of function pointers, for the various interfaces, provided by a file system. These are:

struct file_system_type – contains functions to operate on the super block
struct super_operations – contains functions to operate on the inodes
struct inode_operations – contains functions to operate on the directory entries
struct file_operations – contains functions to operate on the file data (through page cache)
struct address_space_operations – contains page cache operations for the file data

With these, there were many new terms for Pugs. He referred the following glossary to understand the various terms used above and later in the file system module development:

Page cache or Buffer cache: Pool of RAM buffers, each of page size (typically 4096 bytes). These buffers are used as the cache for the file data read from the underlying hardware, thus increasing the performance of file operations
Inode: Structure containing the meta data / information of a file, like permissions, owner, etc. Though file name is a meta data of a file, for better space utilization, in typical Linux file systems, it is not kept in inode, instead in something called directory entries. Collection of inodes, is called an inode table
Directory entry: Structure containing the name and inode number of a file or directory. In typical Linux based file systems, a collection of directory entries for the immediate files and directories of say directory D, is stored in the data blocks of the directory D
Super block: Structure containing the information about the various data structures of the file systems, like the inode tables, … Basically the meta meta data, i.e. meta data for the meta data
Virtual File System (VFS): Conceptual file system layer interfacing the kernel space to user space in an abstract manner, showing “everything” as a file, and translating their operations from user to the appropriate entity in the kernel space

Each one of the above five structures contains a list of function pointers, which needs to be populated depending on what all features are there or to be supported in the file system (module). For example, struct file_system_type may contain system calls for mounting and unmounting a file system, basically operating on its super block; struct super_operations may contain inode read/write system calls; struct inode_operations may contain function to lookup directory entries; struct file_operations may generically operate on the page cached file data, which may in turn invoke page cache operations, defined in the struct address_space_operations. For these various operations, most of these functions will then interface with the corresponding underlying block device driver to ultimately operate with the formatted file system in the hardware space.

To start with Pugs laid out the complete framework of his real SFS module, but with minimal functionality, good enough to compile, load, and not crash the kernel. He populated only the first of these five structures – the struct file_system_type; and left all the others empty. Here’s the exact code of the structure definitions:

#include <linux/fs.h> /* For system calls, structures, ... */

static struct file_system_type sfs;
static struct super_operations sfs_sops;
static struct inode_operations sfs_iops;
static struct file_operations sfs_fops;
static struct address_space_operations sfs_aops;

#include <linux/version.h> /* For LINUX_VERSION_CODE & KERNEL_VERSION */

static struct file_system_type sfs =
{
	name: "sfs", /* Name of our file system */
#if (LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38))
	get_sb:  sfs_get_sb,
#else
	mount:  sfs_mount,
#endif
	kill_sb: kill_block_super,
	owner: THIS_MODULE
};

Note that before Linux kernel version 2.6.38, the mount function pointer was referred as get_sb, and also, it used to have slightly different parameters. And hence, the above #if for it to be compatible at least across 2.6.3x and possibly with 3.x kernel versions – no guarantee for others. Accordingly, the corresponding functions sfs_get_sb() and sfs_mount(), are also #if’d, as follows:

#include <linux/kernel.h> /* For printk, ... */

if (LINUX_VERSION_CODE < KERNEL_VERSION(2,6,38))
static int sfs_get_sb(struct file_system_type *fs_type, int flags,
			const char *devname, void *data, struct vfsmount *vm)
{
	printk(KERN_INFO "sfs: devname = %s\n", devname);

	/* sfs_fill_super this will be called to fill the super block */
	return get_sb_bdev(fs_type, flags, devname, data, &sfs_fill_super, vm);
}
#else
static struct dentry *sfs_mount(struct file_system_type *fs_type,
					int flags, const char *devname, void *data)
{
	printk(KERN_INFO "sfs: devname = %s\n", devname);

	/* sfs_fill_super this will be called to fill the super block */
	return mount_bdev(fs_type, flags, devname, data, &sfs_fill_super);
}
#endif

The only difference in the above 2 functions is that in the later, the VFS mount point related structure has been removed. The printk() in there would display the underlying partition’s device file which the user is going to mount, basically the pen drive’s SFS formatted partition. get_sb_bdev() and mount_bdev() are generic block device mount functions for the respective kernel versions, defined in fs/super.c and prototyped in <linux/fs.h>. Pugs also used them, as most other file system writers do. Are you wondering: Does all file system mount a block device, the same way? Most of it yes, except the part where the mount operation needs to fill in the VFS’ super block structure (struct super_block), as per the super block of the underlying file system – obviously that most probably would be different. But then how does it do that? Observe carefully, in the above functions, apart from passing all the parameters as is, there is an additional parameter sfs_fill_super, and that is Pugs’ custom function to fill the VFS’ super block, as per the SFS file system.

Unlike the mount function pointer, the unmount function pointer has been same (kill_sb) for quite some kernel versions; and in unmounting, there is not even the minimal distinction required across different file systems. So, the generic block device unmount function kill_block_super() has been used directly as the function pointer.

In sfs_fill_super(), Pugs is ideally supposed to read the super block from the underlying hardware-space SFS, and then accordingly translate and fill that into VFS’ super block to enable VFS to provide the user space file system interface. But he is yet to figure that out, as how to read from the underlying block device, in the kernel space. Information of which block device to use, is already embedded into the super_block structure itself, obtained from the user issuing the mount command. But as Pugs decided to get the bare bone real SFS up, first, he went ahead writing this sfs_super_fill() function also as a hard-coded fill function. And with that itself, he registered the Simula file system with the VFS. As any other Linux driver, here’s the file system driver’s constructor and destructor for that:

#include <linux/module.h> /* For module related macros, ... */

static int __init sfs_init(void)
{
	int err;

	err = register_filesystem(&sfs);
	return err;
}

static void __exit sfs_exit(void)
{
	unregister_filesystem(&sfs);
}

module_init(sfs_init);
module_exit(sfs_exit);

Both register_filesystem() and unregister_filesystem() takes pointer to the the struct file_system_type sfs (filled above), as their parameter, to respectively register and unregister the file system described by it.

Hard-coded SFS super block and root inode

And yes, here’s the hard-coded sfs_fill_super() function:

#include "real_sfs_ds.h" /* For SFS related defines, data structures, ... */

static int sfs_fill_super(struct super_block *sb, void *data, int silent)
{
	printk(KERN_INFO "sfs: sfs_fill_super\n");

	sb->s_blocksize = SIMULA_FS_BLOCK_SIZE;
	sb->s_blocksize_bits = SIMULA_FS_BLOCK_SIZE_BITS;
	sb->s_magic = SIMULA_FS_TYPE;
	sb->s_type = &sfs; // file_system_type
	sb->s_op = &sfs_sops; // super block operations

	sfs_root_inode = iget_locked(sb, 1); // obtain an inode from VFS
	if (!sfs_root_inode)
	{
		return -EACCES;
	}
	if (sfs_root_inode->i_state & I_NEW) // allocated fresh now
	{
		printk(KERN_INFO "sfs: Got new root inode, let's fill in\n");
		sfs_root_inode->i_op = &sfs_iops; // inode operations
		sfs_root_inode->i_mode = S_IFDIR | S_IRWXU |
			S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH;
		sfs_root_inode->i_fop = &sfs_fops; // file operations
		sfs_root_inode->i_mapping->a_ops = &sfs_aops; // address operations
		unlock_new_inode(sfs_root_inode);
	}
	else
	{
		printk(KERN_INFO "sfs: Got root inode from inode cache\n");
	}

#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,4,0))
	sb->s_root = d_alloc_root(sfs_root_inode);
#else
	sb->s_root = d_make_root(sfs_root_inode);
#endif
	if (!sb->s_root)
	{
		iget_failed(sfs_root_inode);
		return -ENOMEM;
	}

	return 0;
}

As mentioned earlier, this function is basically supposed to read the underlying SFS super block, and accordingly translate and fill the struct super_block, pointed to by its first parameter sb. So, understanding it is same as understanding the minimal fields of the struct super_block, which are getting filled up. The first three are the block size, its logarithm base 2, and the type/magic code of the Simula file system. As Pugs codes further, we shall see that once he gets the super block from the hardware space, he would instead get these values from that super block, and more importantly verify them, to ensure that the correct partition is being mounted.

After that, the various structure pointers are pointed to their corresponding structure of the function pointers. Last but not least, the root inode’s pointer s_root is pointed to the struct inode structure, obtained from VFS’ inode cache, based on the inode number of root – right now, which has been hard coded to 1 – it may possibly change. If the inode structure is obtained fresh, i.e. for the first time, it is then filled as per the underlying SFS’ root inode’s content. Also, the mode field is being hard-coded to “drwxr-xr-x“. Apart from that, the usual structure pointers are being initialized by the corresponding structure addresses. And finally, the root’s inode is being attached to the super block using d_alloc_root() or d_make_root(), as per the kernel version.

All the above code pieces put in together as the bare bone real_sfs_bb.c, along with the real_sfs_ds.h (based on the same file created earlier), and a supporting Makefile are available from rsfsbb_code.tbz2.

Bare bone SFS module in action

Once compiled using make, getting the real_sfs_bb.ko driver, Pugs did his usual unusual experiments, shown as in Figure 38.

Figure 38: Bare-bone real SFS experiments

Pugs’ experiments (Explanation of Figure 38):

Checked the kernel window /proc/filesystems for the kernel supported file systems
Loaded the real_sfs_bb.ko driver
Re-checked the kernel window /proc/filesystems for the kernel supported file systems. Now, it shows sfs listed at the end
Did a mount of his pen drive partition /dev/sdb1 onto /mnt using the sfs file system. Checked the dmesg logs on the adjacent window. (Keep in mind, that right now, the sfs_fill_super() is not really reading the partition, and hence not doing any checks. So, it really doesn’t matter as to how the /dev/sdb1 is formatted.) But yes, the mount output shows that it is mounted using the sfs file system

Oops!!! But df output shows “Function not implemented”, cd gives “Not a directory”. Aha!! Pugs haven’t implemented any other functions in any of the other four function pointer structures, yet. So, that’s expected.

Note: The above experiments are using “sudo”. Instead one may get into root shell and do the same without a “sudo”.

Okay, so no kernel crashes, and a bare bone file system in action – Yippee. Ya! Ya! Pugs knows that df, cd, … are not yet functional. For that, he needs to start adding the various system calls in the other (four) function pointer structures to be able to do cool-cool browsing, the same way as is done with all other file systems, using the various shell commands. And yes, Pugs is already onto his task – after all he needs to have a geeky demo for his final semester project.

Twenty-third Article >>

The Semester Project – Part IV: Formatting a Pen Drive

2 Replies

This twenty-first article, which is part of the series on Linux device drivers, takes the next step towards writing a file system module by writing a formatting application for your real pen drive.

<< Twentieth Article

Thanks friends for your confidence in Shweta and not trying to help her out in figuring out the issues with her code. She indeed figured out and fixed the following issues in her code:

sfs_read() and sfs_write() need to check for the read and write file permissions before proceeding to read and write, respectively
sfs_write() should free any previously allocated blocks, as write is always over-write
Moreover, the earlier written sfs_remove() also, now needs to free up the allocated blocks

SFS Format for a real partition

There after Pugs took the lead and slightly modified Shweta’s format_sfs.c and sfs_ds.h files to format a real pen drive’s partition. The key change is that instead of creating the default regular file .sfsf to format, now it would be operating on an existing block device file corresponding to an underlying partition, say something like /dev/sdb1. So,

It would get the partition’s size from the partition’s block device file itself, rather than taking it as a command-line argument
Accordingly, it would now expect the partition’s block device file name instead of size, as main()‘s first argument.
Also, it would not need the mark_data_block() to grow the file equal to the partition size

The ioctl command BLKGETSIZE64 gets the 64-bit size of the underlying block device partition, in bytes. Then, it is divided by the block size (SIMULA_FS_BLOCK_SIZE) to get the partition’s size in block units. Here is the corresponding modified snippet of the main() function in format_real_sfs.c (updated format_sfs.c), along with the required typedef in real_sfs_ds.h (updated sfs_ds.h). (Note: For readability, Pugs renamed all uint* by byte*):

typedef unsigned long long byte8_t;
...
byte8_t size;

sfs_handle = open(argv[1], O_RDWR);
if (sfs_handle == -1)
{
	fprintf(stderr, "Error formatting %s: %s\n", argv[1], strerror(errno));
	return 2;
}
if (ioctl(sfs_handle, BLKGETSIZE64, &size) == -1)
{
	fprintf(stderr, "Error getting size of %s: %s\n", argv[1], strerror(errno));
	return 3;
}
sb.partition_size = size / SIMULA_FS_BLOCK_SIZE;

As per the above code, the following additional header files need to be included:

#include <errno.h> /* For errno */
#include <string.h> /* For strerror() */
#include <sys/ioctl.h> /* For ioctl() */
#include <linux/fs.h> /* For BLKGETSIZE64 */

With all the above changes compiled into format_real_sfs, Pugs plugged in his pen drive, partition of which got auto mounted. Then, he took backup of its content and unmounted the same – ready for a real SFS format of the pen drive partition.

Caution: Take a backup of your pen drive’s content – you are formatting it for real. Be careful in choosing the right partition of your pen drive. Otherwise, you may forever lose data from your hard disk or even make your system un-bootable. You have been warned.

Figure 36: Formatting the pen drive

Figure 36 demonstrates all the above but backup steps at root prompt #. Instead, one may use sudo, as well. Note that Pugs got his pen drive partition mounted at /media/10AC-BF1C, and the corresponding device file is /dev/sdb1 (/dev/sdb being the complete pen drive). You may have both these differently. Accordingly, follow the steps for yourself. Also, note that, the real SFS formatting is then started using the following command:

# ./format_real_sfs /dev/sdb1

And then, there is a ^C (Ctrl-C) immediately after that, basically terminating the formatting. Aha! Did Pugs realize something important was there on the pen drive? Not really, as his pen drive is already empty. Actually, what happened is that formatting was going on for quite some time – so Pugs had some doubt about his code changes and so he terminated it. Reviewing his code didn’t yield much, so he reissued the formatting, this time with the time command, to figure out exactly how much time is the formatting taking and then may be debug/fix that. And finally! The formatting is complete but after a whopping 430.88 seconds (7+ minutes), yes minutes. time basically shows the real time taken (includes the time when other processes has been running after context switch), time executed in user space, time executed in kernel space. That’s huge – something needs optimization. And it didn’t take much time for a close review to undermine the issue. The key time taker code would be the clear_file_entries() function. Right now its clearing the file entries one by one, i.e. writing 64-byte sized file entries one by one – that’s pretty non-optimal. A better approach would be to fill up a block with such entries, and then write these blocks one by one. In case of a 512-byte block (i.e. SIMULA_FS_BLOCK_SIZE defined as 512), that would mean 8 file entries in a 512-byte block and then writing these 512-byte blocks one by one. So, Pugs changed the clear_file_entries() function to do the same, and viola! formatting is complete in a little less than 26 seconds. Here’s the answer to your curiosity – the re-written clear_file_entries() function:

void clear_file_entries(int sfs_handle, sfs_super_block_t *sb)
{
	int i;
	byte1_t block[SIMULA_FS_BLOCK_SIZE];

	for (i = 0; i < sb->block_size / sb->entry_size; i++)
	{
		memcpy(block + i * sb->entry_size, &fe, sizeof(fe));
	}
	for (i = 0; i < sb->entry_table_size; i++)
	{
		write(sfs_handle, block, sizeof(block));
	}
}

Now, you may plug out and plug in the pen drive back. And you may wonder that neither it is auto-mounted, nor you are able to mount it. That’s expected, as now it is formatted with a file system which is not yet coded as (kernel) module and there is no one in the kernel to decode the same. So, coding that kernel module would be the ultimate step to get everything working like with any other existing file systems (vfat, ext3, …). If you are worried that your pen drive is spoiled, you may re-format it with the FAT32 (vfat) file system as follows (as root or with sudo):

# mkfs.vfat /dev/sdb1 # Be careful with the correct partition device file

and then plug out & plug in the pen drive to get auto-mounted. But, you know Pugs being a cool carefree guy, instead went ahead to try out browsing the Simula file system created on the pen drive partition.

Browsing the pen drive partition

Obviously, there were slight modifications to the browse_sfs.c application as well, in line with the changes to format_sfs.c. Major one being compulsorily taking the partition’s device file to browse as the command-line argument, instead of browsing the default regular file .sfsf.

All the updated files (real_sfs_ds.h, format_real_sfs.c, browse_real_sfs.c and Makefile) are available from rsfs_code.tar.bz2.

Figure 37: SFS browser on pen drive

Figure 37 shows the browser in action. However, the coolest browsing would be the same way as is done with all other file systems, using the shell commands cd, ls, … Yes, and for that we would need the real SFS module in place. Keep following what’s Pugs upto for getting that in place.

Twenty-second Article >>

File Systems: The Semester Project – Part III

2 Replies

This twentieth article, which is part of the series on Linux device drivers, completes the basic simulation of a file system in user space.

<< Nineteenth Article

Till now, Shweta had implemented 4 basic functionalities of the simulated file system (sfs) browser, namely quit (browser), list (files), create (an empty file), remove (a file). Here she adds 3 little advanced functionalities to get a feel of a complete basic file system:

Changing permissions of a file
Reading from a file
Writing into a file

Here’s a sneak peek into her thinking process as how she came up with her implementations.

For the various command implementations, two very common requirements keep popping quite often:

Getting the index of the entry for a particular filename
Updating the entry for a given filename

Hence these two requirements, captured as the functions sfs_lookup() and sfs_update():

int sfs_lookup(int sfs_handle, char *fn, sfs_file_entry_t *fe)
{
	int i;

	lseek(sfs_handle, sb.entry_table_block_start * sb.block_size, SEEK_SET);
	for (i = 0; i < sb.entry_count; i++)
	{   
		read(sfs_handle, fe, sizeof(sfs_file_entry_t));
		if (!fe->name[0]) continue;
		if (strcmp(fe->name, fn) == 0) return i;
	}   

	return -1; 
}

void sfs_update(int sfs_handle, char *fn, int *size, int update_ts, int *perms)
{
	int i;
	sfs_file_entry_t fe;

	if ((i = sfs_lookup(sfs_handle, fn, &fe)) == -1)
	{
		printf("File %s doesn't exist\n", fn);
		return;
	}
	if (size) fe.size = *size;
	if (update_ts) fe.timestamp = time(NULL);
	if (perms && (*perms <= 07)) fe.perms = *perms;
	lseek(sfs_handle, sb.entry_table_block_start * sb.block_size +
						i * sb.entry_size, SEEK_SET);
	write(sfs_handle, &fe, sizeof(sfs_file_entry_t));
}

sfs_lookup() traverses through all the entries (skipping the invalid i.e. empty filename entries), till it finds the filename match and then returns the index and the entry of the matched entry in the function’s third parameter. It returns -1 in case of no match is found.

sfs_update() uses sfs_lookup() to get the entry and its index for the given filename. Then, it updates it back into the filesystem with: a) size, if passed (i.e. non-NULL), b) current timestamp if update_ts is set, c) permissions, if passed (i.e. non-NULL).

Changing file permissions

In the Simula file system, file permissions are basically a combination ‘rwx‘ for the user only, stored as an integer, with Linux like notation of 4 for ‘r‘, 2 for ‘w‘, 1 for ‘x‘. For changing the permissions of a given filename, sfs_update() can be used best by passing NULL pointer for size, zero for update_ts, and the pointer to permissions to change for perms. So sfs_chperm() would be as follows:

void sfs_chperm(int sfs_handle, char *fn, int perm)
{
	sfs_update(sfs_handle, fn, NULL, 0, &perm);
}

Reading a file

Reading a file is basically sequentially reading & displaying the contents of the data blocks indicated by their position from the blocks array of file’s entry and displaying that on stdout’s file descriptor 1. A couple of things to be taken care of:

File is assumed to be without holes, i.e. block position of 0 in the blocks array indicates no more data block’s for the file
Reading should not go beyond the file size. Special care to be taken while reading the last block with data, as it may be partially valid

Here’s the complete read function, keeping track of valid data using bytes left to read:

uint1_t block[SIMULA_FS_BLOCK_SIZE]; // Shared as global with sfs_write

void sfs_read(int sfs_handle, char *fn)
{
	int i, block_i, already_read, rem_to_read, to_read;
	sfs_file_entry_t fe;

	if ((i = sfs_lookup(sfs_handle, fn, &fe)) == -1)
	{
		printf("File %s doesn't exist\n", fn);
		return;
	}
	already_read = 0;
	rem_to_read = fe.size;
	for (block_i = 0; block_i < SIMULA_FS_DATA_BLOCK_CNT; block_i++)
	{
		if (!fe.blocks[block_i]) break;
		to_read = (rem_to_read >= sb.block_size) ?
						sb.block_size : rem_to_read;
		lseek(sfs_handle, fe.blocks[block_i] * sb.block_size, SEEK_SET);
		read(sfs_handle, block, to_read);
		write(1, block, to_read);
		already_read += to_read;
		rem_to_read -= to_read;
		if (!rem_to_read) break;
	}
}

Writing a file

Interestingly, write is not a trivial function. Getting data from the user through browser is okay. But based on that, free available blocks has to be obtained, filled and then their position be noted sequentially in the blocks array of the file’s entry. Typically, we do this whenever we have received a block full data, except the last block. That’s tricky – how do we know the last block? So, we read till end of input, marked by Control-D on its own line from the user – and that is indicated by a return of 0 from read. And in that case, we check if any non-full block of data is left to be written, and if yes follow the same procedure of obtaining a free available block, filling it up (with the partial data), and updating its position in the blocks array.

After all this, we have finally written the file data, along with the data block positions in the blocks array of the file’s entry. And now it’s time to update file’s entry with the total size of data written, as well as timestamp to currently modified. Once done, this entry has to be updated back into the entry table, which is the last step. And in this flow, the following shouldn’t be missed out during getting & filling up free blocks:

Check for no more block positions available in blocks array of the file’s entry
Check for no more free blocks available in the file system

In either of the 2 cases, the thought is to do a graceful stop with data being written upto the maximum possible, and discarding the rest.

Once again all of these put together are in the function below:

uint1_t block[SIMULA_FS_BLOCK_SIZE]; // Shared as global with sfs_read

void sfs_write(int sfs_handle, char *fn)
{
	int i, cur_read_i, to_read, cur_read, total_size, block_i, free_i;
	sfs_file_entry_t fe; 

	if ((i = sfs_lookup(sfs_handle, fn, &fe)) == -1) 
	{   
		printf("File %s doesn't exist\n", fn);
		return;
	}   
	cur_read_i = 0;
	to_read = sb.block_size;
	total_size = 0;
	block_i = 0;
	while ((cur_read = read(0, block + cur_read_i, to_read)) > 0)
	{   
		if (cur_read == to_read)
		{   
			/* Write this block */
			if (block_i == SIMULA_FS_DATA_BLOCK_CNT)
				break; /* File size limit */
			if ((free_i = get_data_block(sfs_handle)) == -1) 
				break; /* File system full */
			lseek(sfs_handle, free_i * sb.block_size, SEEK_SET);
			write(sfs_handle, block, sb.block_size);
			fe.blocks[block_i] = free_i;
			block_i++;
			total_size += sb.block_size;
			/* Reset various variables */
			cur_read_i = 0;
			to_read = sb.block_size;
		}
		else
		{
			cur_read_i += cur_read;
			to_read -= cur_read;
		}
	}
	if ((cur_read <= 0) && (cur_read_i))
	{
		/* Write this partial block */
		if ((block_i != SIMULA_FS_DATA_BLOCK_CNT) &&
			((fe.blocks[block_i] = get_data_block(sfs_handle)) != -1))
		{
			lseek(sfs_handle, fe.blocks[block_i] * sb.block_size,
									SEEK_SET);
			write(sfs_handle, block, cur_read_i);
			total_size += cur_read_i;
		}
	}

	fe.size = total_size;
	fe.timestamp = time(NULL);
	lseek(sfs_handle, sb.entry_table_block_start * sb.block_size +
						i * sb.entry_size, SEEK_SET);
	write(sfs_handle, &fe, sizeof(sfs_file_entry_t));
}

The last stride

With the above 3 sfs command functions, the final change to the browse_sfs() function in (previous article’s) browse_sfs.c would be to add the cases for handling these 3 new commands of chperm, write, and read.

One of the daunting questions, if it has not yet bothered you, is how do you find the free available blocks. Notice that in sfs_write(), we just called a function get_data_block() and everything went smooth. But think through how would that be implemented. Do you need to traverse all the file entry’s every time to figure out which all has been used and the remaining are free. That would be killing. Instead, an easier technique would be to get the used ones by parsing all the file entry’s, only initially once, and then keep track of them whenever more entries are used or freed up. But for that a complete framework needs to be put in place, which includes:

uint1_t typedef (in sfs_ds.h)
used_blocks dynamic array to keep track of used blocks (in browse_sfs.c)
Function init_browsing() to initialize the dynamic array used_blocks, i.e. allocate and mark the initial used blocks (in browse_sfs.c)
Correspondingly the inverse function shut_browsing() to cleanup the same (in browse_sfs.c)
And definitely the functions get_data_block() and put_data_block() to respectively get and put back the free data blocks based on the dynamic array used_blocks (in browse_sfs.c)

All these thoughts, incorporated in the earlier sfs_ds.h and browse_sfs.c files, along with a Makefile and the earlier formatter application format_sfs.c, are available from sfs_code.tar.bz2. Once compiled into browse_sfs and executed as ./browse_sfs, it shows up as something like in Figure 35.

Figure 35: Demo of Simula file system browser’s new features

Summing up

As Shweta, showed the above working demo to her project-mate, he observed some miss-outs, and challenged her to find them out on her own. He hinted them to be related to the newly added functionality and ‘getting free block’ framework – some even visible from the demo, i.e. Figure 35. Can you help Shweta, find them out? If yes, post them in the comments below.

Twenty-first Article >>

File Systems: The Semester Project – Part II

2 Replies

This nineteenth article, which is part of the series on Linux device drivers, continues with introducing the concept of a file system, by simulating one in user space.

<< Eighteenth Article

In the previous article, Shweta readied the partition on the .sfsf file by formating it with the format_sfs application. To complete the understanding of a file system, the next step in its simulation is to browse and play around with the file system created. Here is Shweta’s first-cut browser application to achieve the same. Let’s have a closer look. sfs_ds.h is the same header file, already created by Shweta.

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <time.h>

#include "sfs_ds.h"

sfs_super_block_t sb;

void sfs_list(int sfs_handle)
{
	int i;
	sfs_file_entry_t fe;

	lseek(sfs_handle, sb.entry_table_block_start * sb.block_size, SEEK_SET);
	for (i = 0; i < sb.entry_count; i++)
	{
		read(sfs_handle, &fe, sizeof(sfs_file_entry_t));
		if (!fe.name[0]) continue;
		printf("%-15s  %10d bytes  %c%c%c  %s",
			fe.name, fe.size,
			fe.perms & 04 ? 'r' : '-',
			fe.perms & 02 ? 'w' : '-',
			fe.perms & 01 ? 'x' : '-',
			ctime((time_t *)&fe.timestamp)
			);
	}
}
void sfs_create(int sfs_handle, char *fn)
{
	int i;
	sfs_file_entry_t fe;

	lseek(sfs_handle, sb.entry_table_block_start * sb.block_size, SEEK_SET);
	for (i = 0; i < sb.entry_count; i++)
	{
		read(sfs_handle, &fe, sizeof(sfs_file_entry_t));
		if (!fe.name[0]) break;
		if (strcmp(fe.name, fn) == 0)
		{
			printf("File %s already exists\n", fn);
			return;
		}
	}
	if (i == sb.entry_count)
	{
		printf("No entries left\n", fn);
		return;
	}

	lseek(sfs_handle, -(off_t)(sb.entry_size), SEEK_CUR);

	strncpy(fe.name, fn, 15);
	fe.name[15] = 0;
	fe.size = 0;
	fe.timestamp = time(NULL);
	fe.perms = 07;
	for (i = 0; i < SIMULA_FS_DATA_BLOCK_CNT; i++)
	{
		fe.blocks[i] = 0;
	}

	write(sfs_handle, &fe, sizeof(sfs_file_entry_t));
}
void sfs_remove(int sfs_handle, char *fn)
{
	int i;
	sfs_file_entry_t fe;

	lseek(sfs_handle, sb.entry_table_block_start * sb.block_size, SEEK_SET);
	for (i = 0; i < sb.entry_count; i++)
	{
		read(sfs_handle, &fe, sizeof(sfs_file_entry_t));
		if (!fe.name[0]) continue;
		if (strcmp(fe.name, fn) == 0) break;
	}
	if (i == sb.entry_count)
	{
		printf("File %s doesn't exist\n", fn);
		return;
	}

	lseek(sfs_handle, -(off_t)(sb.entry_size), SEEK_CUR);

	memset(&fe, 0, sizeof(sfs_file_entry_t));
	write(sfs_handle, &fe, sizeof(sfs_file_entry_t));
}

void browse_sfs(int sfs_handle)
{
	int done;
	char cmd[256], *fn;
	int ret;

	done = 0;

	printf("Welcome to SFS Browsing Shell v1.0\n\n");
	printf("Block size	 : %d bytes\n", sb.block_size);
	printf("Partition size : %d blocks\n", sb.partition_size);
	printf("File entry size: %d bytes\n", sb.entry_size);
	printf("Entry tbl size : %d blocks\n", sb.entry_table_size);
	printf("Entry count	: %d\n", sb.entry_count);
	printf("\n");
	while (!done)
	{
		printf(" $> ");
		ret = scanf("%[^\n]", cmd);
		if (ret < 0)
		{
			done = 1;
			printf("\n");
			continue;
		}
		else
		{
			getchar();
			if (ret == 0) continue;
		}
		if (strcmp(cmd, "?") == 0)
		{
			printf("Supported commands:\n");
			printf("\t?\tquit\tlist\tcreate\tremove\n");
			continue;
		}
		else if (strcmp(cmd, "quit") == 0)
		{
			done = 1;
			continue;
		}
		else if (strcmp(cmd, "list") == 0)
		{
			sfs_list(sfs_handle);
			continue;
		}
		else if (strncmp(cmd, "create", 6) == 0)
		{
			if (cmd[6] == ' ')
			{
				fn = cmd + 7;
				while (*fn == ' ') fn++;
				if (*fn != '')
				{
					sfs_create(sfs_handle, fn);
					continue;
				}
			}
		}
		else if (strncmp(cmd, "remove", 6) == 0)
		{
			if (cmd[6] == ' ')
			{
				fn = cmd + 7;
				while (*fn == ' ') fn++;
				if (*fn != '')
				{
					sfs_remove(sfs_handle, fn);
					continue;
				}
			}
		}
		printf("Unknown/Incorrect command: %s\n", cmd);
		printf("Supported commands:\n");
		printf("\t?\tquit\tlist\tcreate <file>\tremove <file>\n");
	}
}

int main(int argc, char *argv[])
{
	char *sfs_file = SIMULA_DEFAULT_FILE;
	int sfs_handle;

	if (argc > 2)
	{
		fprintf(stderr, "Incorrect invocation. Possibilities are:\n");
		fprintf(stderr,
			"\t%s /* Picks up %s as the default partition_file */\n",
			argv[0], SIMULA_DEFAULT_FILE);
		fprintf(stderr, "\t%s [ partition_file ]\n", argv[0]);
		return 1;
	}
	if (argc == 2)
	{
		sfs_file = argv[1];
	}
	sfs_handle = open(sfs_file, O_RDWR);
	if (sfs_handle == -1)
	{
		fprintf(stderr, "Unable to browse SFS over %s\n", sfs_file);
		return 2;
	}
	read(sfs_handle, &sb, sizeof(sfs_super_block_t));
	if (sb.type != SIMULA_FS_TYPE)
	{
		fprintf(stderr, "Invalid SFS detected. Giving up.\n");
		close(sfs_handle);
		return 3;
	}
	browse_sfs(sfs_handle);
	close(sfs_handle);
	return 0;
}

The above (shell like) program primarily reads the super block from the partition file (.sfsf by default, or the file provided from command line), and then gets into browsing the file system based on that information, using the browse_sfs() function. Note the check performed for the valid file system on the partition file using the magic number SIMULA_FS_TYPE.

browse_sfs() prints the file system information and provides four basic file system functionalities using the following commands:

quit – to quit the file system browser
list – to list the current files in the file system (using sfs_list())
create <filename> – to create a new file in the file system (using sfs_create(filename))
remove <filename> – to remove an existing file from the file system (using sfs_remove(filename))

Figure 34 shows the browser in action, using the above commands.

Figure 34: Simula file system browser output

sfs_list() traverses through all the file entries in the partition and prints all the non-null filename entries – with file name, size, permissions, and its creation time stamp. sfs_create() looks up for an available (null filename) entry and then updates it with the given filename, size of 0 bytes, permissions of “rwx”, and the current time stamp. And sfs_remove() looks up for an existing file entry, having the filename to be removed, and then nullifies it. The other parts in the above code are more of basic error handling cases like invalid command in browse_sfs(), existing file name in sfs_create(), non-existing file name in sfs_remove(), etc.

Note that the above application is the first-cut one of a full-fledged application. And so the files created right now are just with the basic fixed parameters. And, there is no way to change their permissions, content, etc, yet. However, it must be clear by now that adding those features is just a matter of adding commands and their corresponding functions, which would be provided for a few more interesting features by Shweta as she further works out the details of the browsing application.

Summing up

Once done with the other features as well, the next set of steps would take the project duo further closer to their final goals. Watch out for who takes charge of the next step of moving the file system logic to the kernel space.

Twentieth Article >>

File Systems: The Semester Project

2 Replies

This eighteenth article, which is part of the series on Linux device drivers, kick starts with introducing the concept of a file system, by simulating one in user space.

<< Seventeenth Article

If you have not yet guessed the semester project topic, then you should have a careful look at the table of contents of the various books on Linux device drivers. You’ll find that no book lists a chapter on file systems. Not because they are not used or not useful. In fact, there are close to 100 different file systems existing, as of today, and possibly most of them in active use. Then, why are they not talked in general? And are so many file systems required? Which one of them is the best? Let’s understand these one by one.

In reality, there is no best file system, and in fact there may not be one to come, as well. This is because a particular file system suits best only for a particular set of requirements. Thus, different requirements ask for different best file systems, leading to so many active file systems and many more to come. So, basically they are not generic enough to be talked generically, rather more of a research topic. And the reason for them being the toughest of all drivers is simple: Least availability of generic written material – the best documentation being the code of various file systems. Hmmm! Sounds perfect for a semester project, right?

In order to get to do such a research project, a lot of smaller steps has to be taken. The first one being understanding the concept of a file system itself. That could be done best by simulating one in user space. Shweta took the ownership of getting this first step done, as follows:

Use a regular file for a partition, that is for the actual data storage of the file system (Hardware space equivalent)
Design a basic file system structure (Kernel space equivalent) and implement the format/mkfs command for it, over the regular file
Provide an interface/shell to type commands and operate on the file system, similar to the usual bash shell. In general, this step is achieved by the shell along with the corresponding file system driver in kernel space. But here that translation is embodied/simulated into the user space interface, itself. (Next article shall discuss this in detail)

The default regular file chosen is .sfsf (simula-ting file system file) in the current directory. Starting . (dot) is for it to be hidden. The basic file system design mainly contains two structures, the super block containing the info about the file system, and the file entry structure containing the info about each file in the file system. Here are Shweta’s defines and structures defined in sfs_ds.h

#ifndef SFS_DS_H
#define SFS_DS_H

#define SIMULA_FS_TYPE 0x13090D15 /* Magic Number for our file system */
#define SIMULA_FS_BLOCK_SIZE 512 /* in bytes */
#define SIMULA_FS_ENTRY_SIZE 64 /* in bytes */
#define SIMULA_FS_DATA_BLOCK_CNT ((SIMULA_FS_ENTRY_SIZE - (16 + 3 * 4)) / 4)

#define SIMULA_DEFAULT_FILE ".sfsf"

typedef unsigned int uint4_t;

typedef struct sfs_super_block
{
	uint4_t type; /* Magic number to identify the file system */
	uint4_t block_size; /* Unit of allocation */
	uint4_t partition_size; /* in blocks */
	uint4_t entry_size; /* in bytes */ 
	uint4_t entry_table_size; /* in blocks */
	uint4_t entry_table_block_start; /* in blocks */
	uint4_t entry_count; /* Total entries in the file system */
	uint4_t data_block_start; /* in blocks */
	uint4_t reserved[SIMULA_FS_BLOCK_SIZE / 4 - 8];
} sfs_super_block_t; /* Making it of SIMULA_FS_BLOCK_SIZE */

typedef struct sfs_file_entry
{
	char name[16];
	uint4_t size; /* in bytes */
	uint4_t timestamp; /* Seconds since Epoch */
	uint4_t perms; /* Permissions for user */
	uint4_t blocks[SIMULA_FS_DATA_BLOCK_CNT];
} sfs_file_entry_t;

#endif

Note that Shweta has put some redundant fields in the sfs_super_block_t rather than computing them every time. And practically that is not space inefficient, as any way lot of empty reserved space is available in the super block, which is expected to be the complete first block of a partition. For example, entry_count is a redundant field as it is same as the entry table’s size (in bytes) divided by the entry’s size, which both are part of the super block structure. Moreover, the data_block_start number could also be computed by how many blocks have been used before it, but again not preferred.

Also, note the hard coded assumptions made about the block size of 512 bytes, and some random magic number for the simula file system. This magic number is used to verify that we are dealing with the right file system, when operating on it.

Coming to the file entry, it contains the following for every file:

Name of up to 15 characters (1 byte for ”)
Size in bytes
Time stamp (creation or modification? Not yet decided by Shweta)
Permissions (just one set, for the user)
Array of block numbers for up to 9 data blocks. Why 9? To make the complete entry of SIMULA_FS_ENTRY_SIZE (64).

With the file system design ready, the make file system (mkfs) or more commonly known format application is implemented next, as format_sfs.c:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#include "sfs_ds.h"

#define SFS_ENTRY_RATIO 0.10 /* 10% of all blocks */
#define SFS_ENTRY_TABLE_BLOCK_START 1

sfs_super_block_t sb =
{
	.type = SIMULA_FS_TYPE,
	.block_size = SIMULA_FS_BLOCK_SIZE,
	.entry_size = SIMULA_FS_ENTRY_SIZE,
	.entry_table_block_start = SFS_ENTRY_TABLE_BLOCK_START
};
sfs_file_entry_t fe; /* All 0's */

void write_super_block(int sfs_handle, sfs_super_block_t *sb)
{
	write(sfs_handle, sb, sizeof(sfs_super_block_t));
}
void clear_file_entries(int sfs_handle, sfs_super_block_t *sb)
{
	int i;

	for (i = 0; i < sb->entry_count; i++)
	{
		write(sfs_handle, &fe, sizeof(fe));
	}
}
void mark_data_blocks(int sfs_handle, sfs_super_block_t *sb)
{
	char c = 0;

	lseek(sfs_handle, sb->partition_size * sb->block_size - 1, SEEK_SET);
	write(sfs_handle, &c, 1); /* To make the file size to partition size */
}

int main(int argc, char *argv[])
{
	int sfs_handle;

	if (argc != 2)
	{
		fprintf(stderr, "Usage: %s <partition size in 512-byte blocks>\n",
			argv[0]);
		return 1;
	}
	sb.partition_size = atoi(argv[1]);
	sb.entry_table_size = sb.partition_size * SFS_ENTRY_RATIO;
	sb.entry_count = sb.entry_table_size * sb.block_size / sb.entry_size;
	sb.data_block_start = SFS_ENTRY_TABLE_BLOCK_START + sb.entry_table_size;

	sfs_handle = creat(SIMULA_DEFAULT_FILE,
		S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
	if (sfs_handle == -1)
	{
		perror("No permissions to format");
		return 2;
	}
	write_super_block(sfs_handle, &sb);
	clear_file_entries(sfs_handle, &sb);
	mark_data_blocks(sfs_handle, &sb);
	close(sfs_handle);
	return 0;
}

Note the heuristic of 10% of total space to be used for the file entries, defined by SFS_ENTRY_RATIO. Apart from that, this format application takes the partition size as input and accordingly creates the default file .sfsf, with:

Its first block as the super block, using write_super_block()
The next 10% of total blocks as the file entry’s zeroed out table, using clear_file_entries()
And the remaining as the data blocks, using mark_data_blocks(). This basically writes a zero at the end, to actually extend the underlying file .sfsf to the partition size

As any other format/mkfs application, format_sfs.c is compiled using gcc, and executed on the shell as follows:

$ gcc format_sfs.c -o format_sfs
$ ./format_sfs 1024 # Partition size in blocks of 512-bytes
$ ls -al # List the .sfsf created with a size of 512 KiBytes

Figure 33 shows the .sfsf pictorially for the partition size of 1024 blocks of 512 bytes each.

Figure 33: Simula file system on 512 KiB of partition (.sfsf)

Summing up

With the above design of simula file system (sfs_ds.h), along with the implementation for its format command (format_sfs.c), Shweta has thus created the empty file system over the simulated partition .sfsf. Now, as listed earlier, the final step in simulating the user space file system is to create the interface/shell to type commands and operate on the empty file system just created on .sfsf. Let’s watch out for Shweta coding that as well, completing the first small step towards understanding their project.

Nineteenth Article >>

Module Interactions

2 Replies

This seventeenth article, which is part of the series on Linux device drivers, demonstrates various interactions with a Linux module.

<< Sixteenth Article

As Shweta and Pugs are gearing up for their final semester project in Linux drivers, they are closing on some final tidbits of technical romancing. This mainly includes the various communications with a Linux module (dynamically loadable and unload-able driver), namely accessing its variables, calling its functions, and passing parameters to it.

Global variables and functions

One might think as what a big deal in accessing the variables and functions of a module, outside it. Just make them global, declare them extern in a header, include the header, and access. In the general application development paradigm, it is this simple – but in kernel development environment, it is not so. Though, recommendations to make everything static by default, has always been there, there were and are cases where non-static globals may be needed. A simple example could be a driver spanning over multiple files and function(s) from one file needed to be called in the other. Now to avoid any kernel collision even with such cases, every module is embodied in its own name space. And we know that two modules with the same name cannot be loaded at the same time. Thus by default zero collision is achieved. However, this also implies that by default nothing from a module can be made really global throughout the kernel, even if we want to. And exactly for such scenarios, the <linux/module.h> header defines the following macros:

EXPORT_SYMBOL(sym)
EXPORT_SYMBOL_GPL(sym)
EXPORT_SYMBOL_GPL_FUTURE(sym)

Each of these exports the symbol passed as their parameter, with additionally putting them in the default, _gpl and _gpl_future sections, respectively. And hence only one of them has to be used for a particular symbol – though the symbol could be either a variable name or a function name. Here’s the complete code (our_glob_syms.c) to demonstrate the same:

#include <linux/module.h>
#include <linux/device.h>

static struct class *cool_cl;
static struct class *get_cool_cl(void)
{
	return cool_cl;
}
EXPORT_SYMBOL(cool_cl);
EXPORT_SYMBOL_GPL(get_cool_cl);

static int __init glob_sym_init(void)
{
	if (IS_ERR(cool_cl = class_create(THIS_MODULE, "cool")))
	/* Creates /sys/class/cool/ */
	{
		return PTR_ERR(cool_cl);
	}
	return 0;
}

static void __exit glob_sym_exit(void)
{
	/* Removes /sys/class/cool/ */
	class_destroy(cool_cl);
}

module_init(glob_sym_init);
module_exit(glob_sym_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anil Kumar Pugalia <email@sarika-pugs.com>");
MODULE_DESCRIPTION("Global Symbols exporting Driver");

Each exported symbol also have a corresponding structure placed into each of the kernel symbol table (__ksymtab), kernel string table (__kstrtab), and kernel CRC table (__kcrctab) sections, marking it to be globally accessible. Figure 30 shows a filtered snippet of the /proc/kallsyms kernel window, before and after loading the module our_glob_syms.ko, which has been compiled using the driver’s usual makefile.

Figure 30: Our global symbols module

The following code shows the supporting header file (our_glob_syms.h), to be included by modules using the exported symbols cool_cl and get_cool_cl:

#ifndef OUR_GLOB_SYMS_H
#define OUR_GLOB_SYMS_H

#ifdef __KERNEL__
#include <linux/device.h>

extern struct class *cool_cl;
extern struct class *get_cool_cl(void);
#endif

#endif

Figure 30 also shows the file Module.symvers, generated by compilation of the module our_glob_syms. This contains the various details of all the exported symbols in its directory. Apart from including the above header file, the modules using the exported symbols, possibly should have this file Module.symvers in their build directory.

Note the <linux/device.h> header in the above examples, is being included for the various class related declarations & definitions, which has been already covered under the character drivers discussions.

Module Parameters

Being aware of passing command line arguments to an application, it is a natural quest to ask if something similar can be done with a module. And the answer is yes. Parameters can be passed to a module along with loading it, say using insmod. Interestingly enough and in contrast with the command line arguments to an application, these can be modified even later as well, through sysfs interactions.

The module parameters are setup using the following macro (defined in <linux/moduleparam.h>, included through <linux/module.h>):

module_param(name, type, perm)

where, name is the parameter name, type is the type of the parameter, and perm is the permissions of the sysfs file corresponding to this parameter. Supported type values are: byte, short, ushort, int, uint, long, ulong, charp (character pointer), bool or invbool (inverted boolean). The following module code (module_param.c) demonstrates a module parameter:

#include <linux/module.h>
#include <linux/kernel.h>

static int cfg_value = 3;
module_param(cfg_value, int, 0764);

static int __init mod_par_init(void)
{
	printk(KERN_INFO "Loaded with %d\n", cfg_value);
	return 0;
}

static void __exit mod_par_exit(void)
{
	printk(KERN_INFO "Unloaded cfg value: %d\n", cfg_value);
}

module_init(mod_par_init);
module_exit(mod_par_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anil Kumar Pugalia <email@sarika-pugs.com>");
MODULE_DESCRIPTION("Module Parameter demonstration Driver");

Note that before the parameter setup, a variable of the same name and compatible type needs to be defined.

Subsequently, the following steps and experiments are shown in Figures 31 and 32:

Building the driver (module_param.ko file) using the driver’s usual makefile
Loading the driver using insmod (with and without parameters)
Various experiments through the corresponding /sys entries
And finally, unloading the driver using rmmod

Figure 31: Experiments with module parameter

Figure 32: Experiments with module parameter (as root)

Observe the following:

Initial value (3) of cfg_value becomes its default value when insmod is done without any parameters
Permission 0764 gives rwx to the user, rw- to the group, and r– for the others on the file cfg_value under parameters of module_param under /sys/module/

Check for yourself:

Output of dmesg | tail on every insmod and rmmod for the output of printk‘s
Try writing into the /sys/module/module_param/parameters/cfg_value file as a normal user

Summing up

With this, the duo have a fairly good understanding of Linux drivers and they are all set to start working on their final semester project. Any guesses on what their topic is all about? Hint: They have picked up one of the most daunting Linux driver topic. Let us see, how they fair at it.

Eighteenth Article >>

Kernel Window: Peeping through /proc

7 Replies

This sixteenth article, which is part of the series on Linux device drivers, demonstrates the creation and usage of files under the /proc virtual file-system.

<< Fifteenth Article

Useful kernel windows

After many months, Shweta and Pugs got together for a peaceful technical romancing. All through, we have been using all kinds of kernel windows especially through the /proc virtual file-system (using cat), to help us out in decoding the various nitty-gritties of Linux device drivers. Here’s a non-exhaustive summary listing:

/proc/modules – Listing of all the dynamically loaded modules
/proc/devices – Listing of all the registered character and block major numbers
/proc/iomem – Listing of on-system physical RAM & bus device addresses
/proc/ioports – Listing of on-system I/O port addresses (specially for x86 systems)
/proc/interrupts – Listing of all the registered interrupt request numbers
/proc/softirqs – Listing of all the registered soft irqs
/proc/kallsyms – Listing of all the running kernel symbols, including from loaded modules
/proc/partitions – Listing of currently connected block devices & their partitions
/proc/filesystems – Listing of currently active file-system drivers
/proc/swaps – Listing of currently active swaps
/proc/cpuinfo – Information about the CPU(s) on the system
/proc/meminfo – Information about the memory on the system, viz. RAM, swap, …

Custom kernel windows

“Yes, these have been really helpful in understanding and debugging the Linux device drivers. But is it possible for us to also provide some help? Yes, I mean can we create one such kernel window through /proc?”, questioned Shweta.

“Why one? As many as you want. And that’s simple – just use the right set of APIs and there you go.”

“For you every thing is simple.”

“No yaar, this is seriously simple”, smiled Pugs. “Just watch out, me creating one for you.”

And in jiffies, Pugs created the proc_window.c file below (including the changes, which has taken place in kernel v3.10 and v4.3):

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/version.h>
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
#else
#include <linux/fs.h>
#include <linux/seq_file.h>
#endif
#include <linux/proc_fs.h>
#include <linux/jiffies.h>
#include <linux/uaccess.h>

#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
#define STR_PRINTF_RET(len, str, args...) len += sprintf(page + len, str, ## args)
#elif (LINUX_VERSION_CODE < KERNEL_VERSION(4,3,0))
#define STR_PRINTF_RET(len, str, args...) len += seq_printf(m, str, ## args)
#else
#define STR_PRINTF_RET(len, str, args...) seq_printf(m, str, ## args)
#endif

static struct proc_dir_entry *parent, *file, *link;
static int state = 0;

#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
static int time_read(char *page, char **start, off_t off, int count, int *eof,
	void *data)
#else
static int time_read(struct seq_file *m, void *v)
#endif
{
	int len = 0, val;
	unsigned long act_jiffies;

	STR_PRINTF_RET(len, "state = %d\n", state);
	act_jiffies = jiffies - INITIAL_JIFFIES;
	val = jiffies_to_msecs(act_jiffies);
	switch (state)
	{
		case 0:
			STR_PRINTF_RET(len, "time = %ld jiffies\n", act_jiffies);
			break;
		case 1:
			STR_PRINTF_RET(len, "time = %d msecs\n", val);
			break;
		case 2:
			STR_PRINTF_RET(len, "time = %ds %dms\n",
					val / 1000, val % 1000);
			break;
		case 3:
			val /= 1000;
			STR_PRINTF_RET(len, "time = %02d:%02d:%02d\n",
					val / 3600, (val / 60) % 60, val % 60);
			break;
		default:
			STR_PRINTF_RET(len, "\n");
			break;
	}

	return len;
}
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
static int time_write(struct file *file, const char __user *buffer,
	unsigned long count, void *data)
#else
static ssize_t time_write(struct file *file, const char __user *buffer,
	size_t count, loff_t *off)
#endif
{
	char kbuf[2];

	if (count > 2)
		return count;
	if (copy_from_user(kbuf, buffer, count))
	{
		return -EFAULT;
	}
	if ((count == 2) && (kbuf[1] != '\n'))
		return count;
	if ((kbuf[0] < '0') || ('9' < kbuf[0]))
		return count;
	state = kbuf[0] - '0';
	return count;
}
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
#else
static int time_open(struct inode *inode, struct file *file)
{
	return single_open(file, time_read, NULL);
}

static struct file_operations fops =
{
	.owner = THIS_MODULE,
	.open = time_open,
	.read = seq_read,
	.write = time_write
};
#endif

static int __init proc_win_init(void)
{
	if ((parent = proc_mkdir("anil", NULL)) == NULL)
	{
		return -1;
	}
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
	if ((file = create_proc_entry("rel_time", 0666, parent)) == NULL)
#else
	if ((file = proc_create("rel_time", 0666, parent, &fops)) == NULL)
#endif
	{
		remove_proc_entry("anil", NULL);
		return -1;
	}
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
	file->read_proc = time_read;
	file->write_proc = time_write;
#endif
	if ((link = proc_symlink("rel_time_l", parent, "rel_time")) == NULL)
	{
		remove_proc_entry("rel_time", parent);
		remove_proc_entry("anil", NULL);
		return -1;
	}
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0))
	link->uid = 0;
	link->gid = 100;
#else
	proc_set_user(link, KUIDT_INIT(0), KGIDT_INIT(100));
#endif
	return 0;
}

static void __exit proc_win_exit(void)
{
	remove_proc_entry("rel_time_l", parent);
	remove_proc_entry("rel_time", parent);
	remove_proc_entry("anil", NULL);
}

module_init(proc_win_init);
module_exit(proc_win_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anil Kumar Pugalia <email@sarika-pugs.com>");
MODULE_DESCRIPTION("Kernel window /proc Demonstration Driver");

And then Pugs did the following steps:

Built the driver (proc_window.ko file) using the usual driver’s Makefile.
Loaded the driver using insmod.
Showed various experiments using the newly created proc windows. (Refer to Figure 28.)
And finally, unloaded the driver using rmmod.

Figure 28: Peeping through /proc

Demystifying the details

Starting from the constructor proc_win_init(), three proc entries are created:

Directory anil under /proc (i.e. NULL parent) with default permissions 0755, using proc_mkdir()
Regular file rel_time under the above directory anil with permissions 0666, using create_proc_entry() or proc_create() (since kernel v3.10)
Soft link rel_time_l to the above file rel_time under the same directory anil, using proc_symlink()

The corresponding removal of these three is achieved using remove_proc_entry() in the destructor proc_win_exit(), in chronological reverse order.

For every entry created under /proc, a corresponding struct proc_dir_entry is created. After which many of its fields could be further updated as needed:

mode – Permissions of the file
uid – User ID of the file
gid – Group ID of the file

Since kernel v3.10, specific API proc_set_user() has been provided to update the uid & gid, instead of directly modifying the fields.

Additionally, for a regular file, the following two function pointers for read and write over the file could be provided, respectively:

int (*read_proc)(char *page, char **start, off_t off, int count, int *eof, void *data)
int (*write_proc)(struct file *file, const char __user *buffer, unsigned long count, void *data)

Again, since kernel v3.10, the read & write fields of struct file_operations are to be used respectively, instead of the above two.

write_proc() is very similar to the character device file operation write. And the above code implementation, lets the user to write a digit from 0 to 9, and accordingly sets the internal state.

read_proc() in the above code implementation provides the current state and the time since the system has been booted up in different units based on the current state. These are, jiffies in state 0; milliseconds in state 1; seconds & milliseconds in state 2; hours : minutes : seconds in state 3; and <not implemented> in other states. And to check the computation accuracy, Figure 29 highlights the system up time in the output of top. read_proc()‘s page parameter is a page-sized buffer, typically to be filled up with count bytes from offset off. But more often than not (because of small content), just page is filled up, ignoring all other parameters. Since kernel v3.10, the read logic has to be implemented through something called a sequence file, which is implemented by specifying the pre-defined seq_read() function for the file operation read, and then by providing the above logic in the sequence file’s read function through the file operation open using single_open().

Figure 29: Comparison with top’s output

All the /proc related structure definitions and function declarations are available through <linux/proc_fs.h>. The sequence file related stuff (since kernel v3.10) are available through <linux/seq_file.h>. And the jiffies related function declarations and macro definitions are in <linux/jiffies.h>. On a special note, the actual jiffies are being calculated by subtracting INITIAL_JIFFIES, as on boot-up, jiffies is initialized to INITIAL_JIFFIES instead of zero.

Summing up

“Hey Pugs!!! Why did you create the folder named as anil? Who is this Anil? You could have used my name or may be yours.” suggested Shweta. “Ha!! That’s a surprise. My real name is Anil only. It is just that everyone in college knows me only as Pugs”, smiled Pugs. Watch out for further technical romancing of Pugs aka Anil.

Seventeenth Article >>

Notes

One of the powerful usage of creating our own proc window is for customized debugging by querying. A simple but cool modification of the above driver, could be instead of getting the jiffies, it could read / write hardware registers.

Disk on RAM: Playing destructively

10 Replies

This fifteenth article, which is part of the series on Linux device drivers, experiments with a dummy hard disk on RAM to demonstrate the block drivers.

<< Fourteenth Article

Play first, rules later

After a delicious lunch, theory makes audience sleepy. So, let’s start with the code itself. Code demonstrated is available at dor_code.tar.bz2. This tar ball contains 3 ‘C’ source files, 2 ‘C’ headers, and a Makefile. As usual, executing make will build the ‘disk on ram’ driver (dor.ko) – this time combining the 3 ‘C’ files. Check out the Makefile to see how. make clean would do the usual clean of the built stuff.

Once built, the following are the experimenting steps (Refer to Figures 25, 26, 27):

Load the driver dor.ko using insmod. This would create the block device files representing the disk on 512 KibiBytes (KiB) of RAM, with 3 primary and 3 logical partitions.
Checkout the automatically created block device files (/dev/rb*). /dev/rb is the entire disk of 512 KiB size. rb1, rb2, rb3 are the primary partitions with rb2 being the extended partition and containing the 3 logical partitions rb5, rb6, rb7.
Read the entire disk (/dev/rb) using the disk dump utility dd.
Zero out the first sector of the disk’s first partition (/dev/rb1) again using dd.
Write some text into the disk’s first partition (/dev/rb1) using cat.
Display the initial contents of the first partition (/dev/rb1) using the xxd utility. See Figure 26 for the xxd output.
Display the partition info for the disk using fdisk. See Figure 27 for the fdisk output.
(Quick) Format the third primary partition (/dev/rb3) as vfat filesystem (like your pen drive), using mkfs.vfat (Figure 27).
Mount the newly formatted partition using mount, say at /mnt (Figure 27).
Disk usage utility df would now show this partition mounted at /mnt (Figure 27). You may go ahead and store your files there. But, please remember that these partitions are all on a disk on RAM, and so non-persistent. Hence,
Unloading the driver using ‘rmmod dor‘ would vanish everything. Though the partition needs to be unmounted using ‘umount /mnt‘ before doing that.

Please note that all the above experimenting steps need to be executed with root privileges.

Figure 25: Playing with ‘Disk on RAM’ driver

Figure 26: xxd showing the initial data on the first partition (/dev/rb1)

Figure 27: Formatting the third partition (/dev/rb3)

Now, let’s learn the rules

We have just now played around with the disk on RAM but without actually knowing the rules, i.e. the internal details of the game. So, let’s dig into the nitty-gritties to decode the rules. Each of the three .c files represent a specific part of the driver. ram_device.c and ram_device.h abstract the underlying RAM operations like vmalloc/vfree, memcpy, etc, providing disk operation APIs like init/cleanup, read/write, etc. partition.c and partition.h provide the functionality to emulate the various partition tables on the disk on RAM. Recall the pre-lunch session (i.e. the previous article) to understand the details of partitioning. The code in this is responsible for the partition information like number, type, size, etc that is shown up on the disk on RAM using fdisk. ram_block.c is the core block driver implementation exposing the disk on RAM as the block device files (/dev/rb*) to the user-space. In other words, the four files ram_device.* and partition.* form the horizontal layer of the device driver and ram_block.c forms the vertical (block) layer of the device driver. So, let’s understand that in detail.

The block driver basics

Conceptually, the block drivers are very similar to character drivers, especially with regards to the following:

Usage of device files
Major and minor numbers
Device file operations
Concept of device registration

So, if one already knows character driver implementations, it would be similar to understand the block drivers. Though, they are definitely not identical. The key differences could be listed out as follows:

Abstraction for block-oriented versus byte-oriented devices
Block drivers are designed to be used by I/O schedulers, for optimal performance. Compare that with character drivers to be used by VFS.
Block drivers are designed to be integrated with the Linux’ buffer cache mechanism for efficient data access. Character drivers are pass-through drivers, accessing the hardware directly.

And these trigger the implementation differences. Let’s analyze the key code snippets from ram_block.c, starting at the driver’s constructor rb_init().

First step is to register for a 8-bit (block) major number. And registering for that implicitly means registering for all the 256 8-bit minor numbers associated with that. The function for that is:

int register_blkdev(unsigned int major, const char *name);

major is the major number to be registered. name is a registration label displayed under the kernel window /proc/devices. Interestingly, register_blkdev() tries to allocate & register a freely available major number, when 0 is passed for its first parameter major; and on success, the allocated major number is returned. The corresponding deregistration function is:

void unregister_blkdev(unsigned int major, const char *name);

Both are prototyped in <linux/fs.h>

Second step is to provide the device file operations, through the struct block_device_operations (prototyped in <linux/blkdev.h>) for the registered major number device files. However, these operations are too few compared to the character device file operations, and mostly insignificant. To elaborate, there are no operations even to read and write. That’s surprising. But as we already know that the block drivers need to integrate with the I/O schedulers, the read-write implementation is achieved through something called request queues. So, along with providing the device file operations, the following needs to provided:

Request queue for queuing the read/write requests
Spin lock associated with the request queue for its concurrent access protection
Request function to process the requests queued in the request queue

Also, there is no separate interface for block device file creations, so the following are also provided:

Device file name prefix, commonly referred as disk_name (“rb” in the dor driver)
Starting minor number for the device files, commonly referred as the first_minor

Finally, two block device-specific things are also provided along with the above, namely:

Maximum number of partitions supported for this block device, by specifying the total minors
Underlying device size in units of 512-byte sectors, for the logical block access abstraction

All these are registered through the struct gendisk using the function:

void add_disk(struct gendisk *disk);

The corresponding delete function is:

void del_gendisk(struct gendisk *disk);

Prior to add_disk(), the various fields of struct gendisk need to be initialized, either directly or using various macros/functions like set_capacity(). major, first_minor, fops, queue, disk_name are the minimal fields to be initialized directly. And even before the initialization of these fields, the struct gendisk needs to be allocated using the function:

struct gendisk *alloc_disk(int minors);

where minors is the total number of partitions supported for this disk. And the corresponding inverse function would be:

void put_disk(struct gendisk *disk);

All these are prototyped in <linux/genhd.h>.

Request queue and its request processing function

The request queue also needs to be initialized and set up into the struct gendisk, before the add_disk(). The request queue is initialized by calling:

struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);

providing the request processing function and the initialized concurrency protection spinlock, as its parameters. The corresponding queue cleanup function is:

void blk_cleanup_queue(struct request_queue *);

The request (processing) function should be defined with the following prototype:

void request_fn(struct request_queue *q);

And it should be coded to fetch a request from its parameter q, say using

struct request *blk_fetch_request(struct request_queue *q);

and then either process it or initiate the processing. Whatever it does, it should be non-blocking, as this request function is called from a non-process context, and also after taking the queue’s spinlock. So, moreover only the functions not releasing or taking the queue’s spinlock should be used within the request function.

A typical request processing as demonstrated by the function rb_request() in ram_block.c is:

while ((req = blk_fetch_request(q)) != NULL) /* Fetching a request */
{
	/* Processing the request: the actual data transfer */
	ret = rb_transfer(req); /* Our custom function */
	/* Informing that the request has been processed with return of ret */
	__blk_end_request_all(req, ret);
}

Request and its processing

rb_transfer() is our key function, which parses a struct request and accordingly does the actual data transfer. The struct request mainly contains the direction of data transfer, starting sector for the data transfer, total number of sectors for the data transfer, and the scatter-gather buffer for data transfer. The various macros to extract these information from the struct request are as follows:

rq_data_dir(req); /* Operation: 0 - read from device; otherwise - write to device */
blk_req_pos(req); /* Starting sector to process */
blk_req_sectors(req); /* Total sectors to process */
rq_for_each_segment(bv, req, iter) /* Iterator to extract individual buffers */

rq_for_each_segment() is the special one which iterates over the struct request (req) using iter, and extracting the individual buffer information into the struct bio_vec (bv: basic input/output vector) on each iteration. And, then on each extraction, the appropriate data transfer is done, based on the operation type, invoking one of the following APIs from ram_device.c:

void ramdevice_write(sector_t sector_off, u8 *buffer, unsigned int sectors);
void ramdevice_read(sector_t sector_off, u8 *buffer, unsigned int sectors);

Check out the complete code of rb_transfer() in ram_block.c

Summing up

“With that, we have actually learnt the beautiful block drivers by traversing through the design of a hard disk, and playing around with partitioning, formatting, and various other raw operations on a hard disk. Thanks for your patient listening. Now, the session is open for questions. Or, you may post your queries as comments, below.”

Sixteenth Article >>