Convert a file into a sparse file. Concatenate multiple files and/or stdin in into a sparse file. (assuming, of course, your Operating System and filesystem support sparse files)
what is a sparse file?
Here's a non-sparse file filled with zeroes:
$ dd if=/dev/zero of=zeroes count=1000 $ ls -l zeroes -rw-r--r-- 1 emikulic users 512000 Dec 29 08:39 zeroes $ du zeroes 504 zeroes
On a UFS-like filesystem, create a file, seek forwards into an area that doesn't exist yet, and write something. The size of the file will extend to cover everything up to the end of what you wrote. The space between the beginning of the file and the beginning of your data will be filled with zeroes, but these zeroes will not be written out to disk blocks and will not use disk space. UFS will remember there are zeroes there by writing this down in the inode and whatever indirect blocks would normally point to those zeroed-out disk blocks.
$ dd if=/dev/zero of=sparse count=1 seek=999 $ ls -l sparse -rw-r--r-- 1 emikulic users 512000 Dec 29 08:39 sparse $ du sparse 8 sparse
sparse is a utility that reads regular files with blocks of zeroes in them and writes them out as sparse files.
$ time sparse 1.?.part 1.??.part outfile.sparse real 0m26.993s user 0m0.977s sys 0m2.281s $ time cat 1.?.part 1.??.part > outfile.plain real 0m42.154s user 0m0.102s sys 0m4.516s $ du outfile.sparse outfile.plain 40560 outfile.sparse 155012 outfile.plain $ cat whatever.* | sparse - outfile.from.stdin
Reading from files always clocks out a little faster than reading from stdin. Avoid piping things into sparse when you don't actually need to.
Just to be really clear:
$ sparse in out
is faster than:
$ sparse - out < in
$ cat in | sparse - out
or any variation thereof.
sparse reads and writes chunks of the same size. I tried modifying the reading code so that it used a huge read buffer, made fewer read() calls, and juggled partial and full blocks around. It ran slower than the naïve version. The change was rolled back.
I tried modifying the reading code to use mmap() instead of read() where possible. It increased time spent in userland, decreased time spent in the kernel, and overall had no effect on the runtime. This change was also rolled back.
what do I do with this?
That's up to you, really.
I've used it for sparsifying partially downloaded files made by BitTorrent and Overnet. Note that the mainline BitTorrent client, running on Unix, will create sparse files by default. This utility is only useful if you've copied or transferred a partially downloaded file from somewhere, thereby expanding the zeroes.
Sparse files require less CPU time and less IO time to read and write compared to non-sparse files full of zeroes.