Tag Archives: Partitions

Disk on RAM: Playing destructively

This fifteenth article, which is part of the series on Linux device drivers, experiments with a dummy hard disk on RAM to demonstrate the block drivers.

<< Fourteenth Article

Play first, rules later

After a delicious lunch, theory makes audience sleepy. So, let’s start with the code itself. Code demonstrated is available at dor_code.tar.bz2. This tar ball contains 3 ‘C’ source files, 2 ‘C’ headers, and a Makefile. As usual, executing make will build the ‘disk on ram’ driver (dor.ko) – this time combining the 3 ‘C’ files. Check out the Makefile to see how. make clean would do the usual clean of the built stuff.

Once built, the following are the experimenting steps (Refer to Figures 25, 26, 27):

  • Load the driver dor.ko using insmod. This would create the block device files representing the disk on 512 KibiBytes (KiB) of RAM, with 3 primary and 3 logical partitions.
  • Checkout the automatically created block device files (/dev/rb*). /dev/rb is the entire disk of 512 KiB size. rb1, rb2, rb3 are the primary partitions with rb2 being the extended partition and containing the 3 logical partitions rb5, rb6, rb7.
  • Read the entire disk (/dev/rb) using the disk dump utility dd.
  • Zero out the first sector of the disk’s first partition (/dev/rb1) again using dd.
  • Write some text into the disk’s first partition (/dev/rb1) using cat.
  • Display the initial contents of the first partition (/dev/rb1) using the xxd utility. See Figure 26 for the xxd output.
  • Display the partition info for the disk using fdisk. See Figure 27 for the fdisk output.
  • (Quick) Format the third primary partition (/dev/rb3) as vfat filesystem (like your pen drive), using mkfs.vfat (Figure 27).
  • Mount the newly formatted partition using mount, say at /mnt (Figure 27).
  • Disk usage utility df would now show this partition mounted at /mnt (Figure 27). You may go ahead and store your files there. But, please remember that these partitions are all on a disk on RAM, and so non-persistent. Hence,
  • Unloading the driver using ‘rmmod dor‘ would vanish everything. Though the partition needs to be unmounted using ‘umount /mnt‘ before doing that.

Please note that all the above experimenting steps need to be executed with root privileges.

Figure 25: Playing with 'Disk on RAM' driver

Figure 25: Playing with ‘Disk on RAM’ driver

Figure 26: xxd showing the initial data on the first partition (/dev/rb1)

Figure 26: xxd showing the initial data on the first partition (/dev/rb1)

Figure 27: Formatting the third partition (/dev/rb3)

Figure 27: Formatting the third partition (/dev/rb3)

Now, let’s learn the rules

We have just now played around with the disk on RAM but without actually knowing the rules, i.e. the internal details of the game. So, let’s dig into the nitty-gritties to decode the rules. Each of the three .c files represent a specific part of the driver. ram_device.c and ram_device.h abstract the underlying RAM operations like vmalloc/vfree, memcpy, etc, providing disk operation APIs like init/cleanup, read/write, etc. partition.c and partition.h provide the functionality to emulate the various partition tables on the disk on RAM. Recall the pre-lunch session (i.e. the previous article) to understand the details of partitioning. The code in this is responsible for the partition information like number, type, size, etc that is shown up on the disk on RAM using fdisk. ram_block.c is the core block driver implementation exposing the disk on RAM as the block device files (/dev/rb*) to the user-space. In other words, the four files ram_device.* and partition.* form the horizontal layer of the device driver and ram_block.c forms the vertical (block) layer of the device driver. So, let’s understand that in detail.

The block driver basics

Conceptually, the block drivers are very similar to character drivers, especially with regards to the following:

  • Usage of device files
  • Major and minor numbers
  • Device file operations
  • Concept of device registration

So, if one already knows character driver implementations, it would be similar to understand the block drivers. Though, they are definitely not identical. The key differences could be listed out as follows:

  • Abstraction for block-oriented versus byte-oriented devices
  • Block drivers are designed to be used by I/O schedulers, for optimal performance. Compare that with character drivers to be used by VFS.
  • Block drivers are designed to be integrated with the Linux’ buffer cache mechanism for efficient data access. Character drivers are pass-through drivers, accessing the hardware directly.

And these trigger the implementation differences. Let’s analyze the key code snippets from ram_block.c, starting at the driver’s constructor rb_init().

First step is to register for a 8-bit (block) major number. And registering for that implicitly means registering for all the 256 8-bit minor numbers associated with that. The function for that is:

int register_blkdev(unsigned int major, const char *name);

major is the major number to be registered. name is a registration label displayed under the kernel window /proc/devices. Interestingly, register_blkdev() tries to allocate & register a freely available major number, when 0 is passed for its first parameter major; and on success, the allocated major number is returned. The corresponding deregistration function is:

void unregister_blkdev(unsigned int major, const char *name);

Both are prototyped in <linux/fs.h>

Second step is to provide the device file operations, through the struct block_device_operations (prototyped in <linux/blkdev.h>) for the registered major number device files. However, these operations are too few compared to the character device file operations, and mostly insignificant. To elaborate, there are no operations even to read and write. That’s surprising. But as we already know that the block drivers need to integrate with the I/O schedulers, the read-write implementation is achieved through something called request queues. So, along with providing the device file operations, the following needs to provided:

  • Request queue for queuing the read/write requests
  • Spin lock associated with the request queue for its concurrent access protection
  • Request function to process the requests queued in the request queue

Also, there is no separate interface for block device file creations, so the following are also provided:

  • Device file name prefix, commonly referred as disk_name (“rb” in the dor driver)
  • Starting minor number for the device files, commonly referred as the first_minor

Finally, two block device-specific things are also provided along with the above, namely:

  • Maximum number of partitions supported for this block device, by specifying the total minors
  • Underlying device size in units of 512-byte sectors, for the logical block access abstraction

All these are registered through the struct gendisk using the function:

void add_disk(struct gendisk *disk);

The corresponding delete function is:

void del_gendisk(struct gendisk *disk);

Prior to add_disk(), the various fields of struct gendisk need to be initialized, either directly or using various macros/functions like set_capacity(). major, first_minor, fops, queue, disk_name are the minimal fields to be initialized directly. And even before the initialization of these fields, the struct gendisk needs to be allocated using the function:

struct gendisk *alloc_disk(int minors);

where minors is the total number of partitions supported for this disk. And the corresponding inverse function would be:

void put_disk(struct gendisk *disk);

All these are prototyped in <linux/genhd.h>.

Request queue and its request processing function

The request queue also needs to be initialized and set up into the struct gendisk, before the add_disk(). The request queue is initialized by calling:

struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);

providing the request processing function and the initialized concurrency protection spinlock, as its parameters. The corresponding queue cleanup function is:

void blk_cleanup_queue(struct request_queue *);

The request (processing) function should be defined with the following prototype:

void request_fn(struct request_queue *q);

And it should be coded to fetch a request from its parameter q, say using

struct request *blk_fetch_request(struct request_queue *q);

and then either process it or initiate the processing. Whatever it does, it should be non-blocking, as this request function is called from a non-process context, and also after taking the queue’s spinlock. So, moreover only the functions not releasing or taking the queue’s spinlock should be used within the request function.

A typical request processing as demonstrated by the function rb_request() in ram_block.c is:

while ((req = blk_fetch_request(q)) != NULL) /* Fetching a request */
{
	/* Processing the request: the actual data transfer */
	ret = rb_transfer(req); /* Our custom function */
	/* Informing that the request has been processed with return of ret */
	__blk_end_request_all(req, ret);
}

Request and its processing

rb_transfer() is our key function, which parses a struct request and accordingly does the actual data transfer. The struct request mainly contains the direction of data transfer, starting sector for the data transfer, total number of sectors for the data transfer, and the scatter-gather buffer for data transfer. The various macros to extract these information from the struct request are as follows:

rq_data_dir(req); /* Operation: 0 - read from device; otherwise - write to device */
blk_req_pos(req); /* Starting sector to process */
blk_req_sectors(req); /* Total sectors to process */
rq_for_each_segment(bv, req, iter) /* Iterator to extract individual buffers */

rq_for_each_segment() is the special one which iterates over the struct request (req) using iter, and extracting the individual buffer information into the struct bio_vec (bv: basic input/output vector) on each iteration. And, then on each extraction, the appropriate data transfer is done, based on the operation type, invoking one of the following APIs from ram_device.c:

void ramdevice_write(sector_t sector_off, u8 *buffer, unsigned int sectors);
void ramdevice_read(sector_t sector_off, u8 *buffer, unsigned int sectors);

Check out the complete code of rb_transfer() in ram_block.c

Summing up

“With that, we have actually learnt the beautiful block drivers by traversing through the design of a hard disk, and playing around with partitioning, formatting, and various other raw operations on a hard disk. Thanks for your patient listening. Now, the session is open for questions. Or, you may post your queries as comments, below.”

Sixteenth Article >>

   Send article as PDF   

Understanding the Partitions: A dive inside the hard disk

This fourteenth article, which is part of the series on Linux device drivers, takes you for a walk inside a hard disk.

<< Thirteenth Article

Hard disk design

“Doesn’t it sound like a mechanical engineering subject: Design of hard disk?”, questioned Shweta. “Yes, it does. But understanding it gets us an insight into its programming aspect”, reasoned Pugs while waiting for the commencement of the seminar on storage systems.

Figure 23: Partition listing by fdisk

Figure 23: Partition listing by fdisk

The seminar started with a few hard disks in the presenter’s hand and then a dive down into her system showing the output of fdisk -l, as shown in Figure 23. The first line shows the hard disk size in human friendly format and in bytes. The second line mentions the number of logical heads, logical sectors per track, and the actual number of cylinders on the disk – these together are referred as the geometry of the disk. The 255 heads indicating the number of platters or disks, as one read-write head is needed per disk. Let’s number them say D1, D2, …, D255. Now, each disk would have the same number of concentric circular tracks, starting from outside to inside. In the above case there are 60801 such tracks per disk. Let’s number them say T1, T2, …, T60801. And a particular track number from all the disks forms a cylinder of the same number. For example, tracks T2 from D1, D2, …, D255 will all together form the cylinder C2. Now, each track has the same number of logical sectors – 63 in our case, say S1, S2, …, S63. And each sector is typically 512 bytes. Given this data, one can actually compute the total usable hard disk size, using the following formula:

Usable hard disk size in bytes = (Number of heads or disks) * (Number of tracks per disk) * (Number of sectors per track) * (Number of bytes per sector, i.e. sector size).

For the disk under consideration it would be: 255 * 60801 * 63 * 512 bytes = 500105249280 bytes. Note that this number may be slightly less than the actual hard disk – 500107862016 bytes in our case. The reason for that is that the formula doesn’t consider the bytes in last partial or incomplete cylinder. And the primary reason for that is the difference between today’s technology of organizing the actual physical disk geometry and the traditional geometry representation using heads, cylinders, sectors. Note that in the fdisk output, we referred to the heads, and sectors per track as logical not the actual one. One may ask, if today’s disks doesn’t have such physical geometry concepts, then why to still maintain that and represent them in more of logical form. The main reason is to be able to continue with same concepts of partitioning and be able to maintain the same partition table formats especially for the most prevalent DOS type partition tables, which heavily depend on this simplistic geometry. Note the computation of cylinder size (255 heads * 63 sectors / track * 512 bytes / sector = 8225280 bytes) in the third line and then the demarcation of partitions in units of complete cylinders.

DOS type partition tables

This brings us to the next important topic of understanding the DOS type partition tables. But in the first place, what is a partition and rather why to even partition? A hard disk can be divided into one more logical disks, each of which is called a partition. And this is useful for organizing different types of data separately. For example, different operating system data, user data, temporary data, etc. So, partitions are basically logical divisions and hence need to be maintained through some meta data, which is the partition table. A DOS type partition table contains 4 partition entries, each being a 16-byte entry. And each of these entries can be depicted by the following ‘C’ structure:

typedef struct
{
	unsigned char boot_type; // 0x00 - Inactive; 0x80 - Active (Bootable)
	unsigned char start_head;
	unsigned char start_sec:6;
	unsigned char start_cyl_hi:2;
	unsigned char start_cyl;
	unsigned char part_type;
	unsigned char end_head;
	unsigned char end_sec:6;
	unsigned char end_cyl_hi:2;
	unsigned char end_cyl;
	unsigned int abs_start_sec;
	unsigned int sec_in_part;
} PartEntry;

And this partition table followed by the two-byte signature 0xAA55 resides at the end of hard disk’s first sector, commonly known as Master Boot Record or MBR (in short). Hence, the starting offset of this partition table within the MBR is 512 – (4 * 16 + 2) = 446. Also, a 4-byte disk signature is placed at the offset 440. The remaining top 440 bytes of the MBR are typically used to place the first piece of boot code, that is loaded by the BIOS to boot up the system from the disk. Listing of part_info.c contains these various defines, and the code for parsing and printing a formatted output of the partition table of one’s hard disk.

From the partition table entry structure, it could be noted that the start and end cylinder fields are only 10 bits long, thus allowing a maximum of 1023 cylinders only. However, for today’s huge hard disks, this size is no way sufficient. And hence in overflow cases, the corresponding <head, cylinder, sector> triplet in the partition table entry is set to the maximum value, and the actual value is computed using the last two fields: the absolute start sector number (abs_start_sec) and the number of sectors in this partition (sec_in_part). Listing of part_info.c contains the code for this as well.

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#define MBR_DISK_SIGNATURE_OFFSET 440
#define PARTITION_ENTRY_SIZE 16 // sizeof(PartEntry)
#define PARTITION_TABLE_SIZE (4 * PARTITION_ENTRY_SIZE)

typedef struct
{
	unsigned char boot_type; // 0x00 - Inactive; 0x80 - Active (Bootable)
	unsigned char start_head;
	unsigned char start_sec:6;
	unsigned char start_cyl_hi:2;
	unsigned char start_cyl;
	unsigned char part_type;
	unsigned char end_head;
	unsigned char end_sec:6;
	unsigned char end_cyl_hi:2;
	unsigned char end_cyl;
	unsigned int abs_start_sec;
	unsigned int sec_in_part;
} PartEntry;

typedef struct
{
	unsigned char boot_code[MBR_DISK_SIGNATURE_OFFSET];
	unsigned int disk_signature;
	unsigned short pad;
	unsigned char pt[PARTITION_TABLE_SIZE];
	unsigned short signature;
} MBR;

void print_computed(unsigned int sector)
{
	unsigned int heads, cyls, tracks, sectors;

	sectors = sector % 63 + 1 /* As indexed from 1 */;
	tracks = sector / 63;
	cyls = tracks / 255 + 1 /* As indexed from 1 */;
	heads = tracks % 255;
	printf("(%3d/%5d/%1d)", heads, cyls, sectors);
}

int main(int argc, char *argv[])
{
	char *dev_file = "/dev/sda";
	int fd, i, rd_val;
	MBR m;
	PartEntry *p = (PartEntry *)(m.pt);

	if (argc == 2)
	{
		dev_file = argv[1];
	}
	if ((fd = open(dev_file, O_RDONLY)) == -1)
	{
		fprintf(stderr, "Failed opening %s: ", dev_file);
		perror("");
		return 1;
	}
	if ((rd_val = read(fd, &m, sizeof(m))) != sizeof(m))
	{
		fprintf(stderr, "Failed reading %s: ", dev_file);
		perror("");
		close(fd);
		return 2;
	}
	close(fd);
	printf("\nDOS type Partition Table of %s:\n", dev_file);
	printf("  B Start (H/C/S)   End (H/C/S) Type  StartSec    TotSec\n");
	for (i = 0; i < 4; i++)
	{
		printf("%d:%d (%3d/%4d/%2d) (%3d/%4d/%2d)  %02X %10d %9d\n",
			i + 1, !!(p[i].boot_type & 0x80),
			p[i].start_head,
			1 + ((p[i].start_cyl_hi << 8) | p[i].start_cyl),
			p[i].start_sec,
			p[i].end_head,
			1 + ((p[i].end_cyl_hi << 8) | p[i].end_cyl),
			p[i].end_sec,
			p[i].part_type,
			p[i].abs_start_sec, p[i].sec_in_part);
	}
	printf("\nRe-computed Partition Table of %s:\n", dev_file);
	printf("  B Start (H/C/S)   End (H/C/S) Type  StartSec    TotSec\n");
	for (i = 0; i < 4; i++)
	{
		printf("%d:%d ", i + 1, !!(p[i].boot_type & 0x80));
		print_computed(p[i].abs_start_sec);
		printf(" ");
		print_computed(p[i].abs_start_sec + p[i].sec_in_part - 1);
		printf(" %02X %10d %9d\n", p[i].part_type,
			p[i].abs_start_sec, p[i].sec_in_part);
	}
	printf("\n");
	return 0;
}

As the above code (part_info.c) is an application, compile it to an executable (./part_info) as follows: gcc part_info.c -o part_info, and then run ./part_info /dev/sda to check out your primary partitioning information on /dev/sda. Figure 24 shows the output of ./part_info on the presenter’s system. Compare it with the fdisk output as shown in figure 23.

Figure 24: Output of ./part_info

Figure 24: Output of ./part_info

Partition types and Boot records

Now as this partition table is hard-coded to have 4 entries, that dictates the maximum number of partitions, we can have, namely 4. These are called primary partitions, each having a type associated with them in their corresponding partition table entry. The various types are typically coined by the various operating system vendors and hence it is a sort of mapping to the various operating systems, for example, DOS, Minix, Linux, Solaris, BSD, FreeBSD, QNX, W95, Novell Netware, etc to be used for/with the particular operating system. However, this is more of ego and formality than a real requirement.

Apart from these, one of the 4 primary partitions can actually be labeled as something referred as an extended partition, and that has a special significance. As name suggests, it is used to further extend the hard disk division, i.e. to have more partitions. These more partitions are referred as logical partitions and are created within the extended partition. Meta data of the logical partitions is maintained in a linked list format, allowing unlimited number of logical partitions, at least theoretically. For that, the first sector of the extended partition, commonly referred to as Boot Record or BR (in short), is used in a similar way as MBR to store (the linked list head of) the partition table for the logical partitions. Subsequent linked list nodes are stored in the first sector of the subsequent logical partitions, referred to as Logical Boot Record or LBR (in short). Each linked list node is a complete 4-entry partition table, though only the first two entries are used – the first for the linked list data, namely, information about the immediate logical partition, and second one as the linked list’s next pointer, pointing to the list of remaining logical partitions.

To compare and understand the primary partitioning details on your system’s hard disk, follow these steps (as root user and hence with care):

  • ./part_info /dev/sda # Displays the partition table on /dev/sda
  • fdisk -l /dev/sda # To display and compare the partition table entries with the above

In case you have multiple hard disks (/dev/sdb, …), or hard disk device file with other names (/dev/hda, …), or an extended partition, you may try ./part_info <device_file_name> on them as well. Trying on an extended partition would give the information about the starting partition table of the logical partitions.

Summing up

“Right now we carefully and selectively played (read only) with our system’s hard disk. Why carefully? As otherwise, we may render our system non-bootable. But no learning is complete without a total exploration. Hence, in our post-lunch session, we would create a dummy disk on RAM and do the destructive explorations.”

Fifteenth Article >>

   Send article as PDF