sparse

Convert a file into a sparse file. Concatenate multiple files and/or stdin in into a sparse file. (assuming, of course, your Operating System and filesystem support sparse files)

sparse.c (3.8KB)

what is a sparse file?

Here's a non-sparse file filled with zeroes:

$ dd if=/dev/zero of=zeroes count=1000
$ ls -l zeroes
-rw-r--r--    1 emikulic users      512000 Dec 29 08:39 zeroes
$ du zeroes
504     zeroes

On a UFS-like filesystem, create a file, seek forwards into an area that doesn't exist yet, and write something. The size of the file will extend to cover everything up to the end of what you wrote. The space between the beginning of the file and the beginning of your data will be filled with zeroes, but these zeroes will not be written out to disk blocks and will not use disk space. UFS will remember there are zeroes there by writing this down in the inode and whatever indirect blocks would normally point to those zeroed-out disk blocks.

$ dd if=/dev/zero of=sparse count=1 seek=999
$ ls -l sparse
-rw-r--r--    1 emikulic users      512000 Dec 29 08:39 sparse
$ du sparse
8       sparse

sparse is a utility that reads regular files with blocks of zeroes in them and writes them out as sparse files.

examples

$ time sparse 1.?.part 1.??.part outfile.sparse
real    0m26.993s
user    0m0.977s
sys     0m2.281s

$ time cat 1.?.part 1.??.part > outfile.plain
real    0m42.154s
user    0m0.102s
sys     0m4.516s

$ du outfile.sparse outfile.plain
40560   outfile.sparse
155012  outfile.plain

$ cat whatever.* | sparse - outfile.from.stdin

performance notes

Reading from files always clocks out a little faster than reading from stdin. Avoid piping things into sparse when you don't actually need to.

Just to be really clear:

$ sparse in out

is faster than:

$ sparse - out < in

or:

$ cat in | sparse - out

or any variation thereof.

sparse reads and writes chunks of the same size. I tried modifying the reading code so that it used a huge read buffer, made fewer read() calls, and juggled partial and full blocks around. It ran slower than the naïve version. The change was rolled back.

I tried modifying the reading code to use mmap() instead of read() where possible. It increased time spent in userland, decreased time spent in the kernel, and overall had no effect on the runtime. This change was also rolled back.

what do I do with this?

That's up to you, really.

I've used it for sparsifying partially downloaded files made by BitTorrent and Overnet. Note that the mainline BitTorrent client, running on Unix, will create sparse files by default. This utility is only useful if you've copied or transferred a partially downloaded file from somewhere, thereby expanding the zeroes.

Sparse files require less CPU time and less IO time to read and write compared to non-sparse files full of zeroes.