ZFS Deduplication:
Task: Users tend to keep a lot of similar files in their archives. Is it possible to save space by using deduplication?
Lab: We will create a ZFS file system with deduplication turned on and see if it helps.
Let's model the following situation: we have a file system which is used as an archive. We'll create separate file systems for each user and imagine that they store similar files there.
We will use the ZFS pool called labpool
that we have created in the
first exercise.
Create a file system with deduplication and compression:
root@solaris:~# zfs create -o dedup=on -o compression=gzip labpool/archive
Create users' file systems (we'll call them a, b, c, d for simplicity):
root@solaris:~# zfs create labpool/archive/a root@solaris:~# zfs create labpool/archive/b root@solaris:~# zfs create labpool/archive/c root@solaris:~# zfs create labpool/archive/d
Check their "dedup" parameter:
root@solaris:~# zfs get dedup labpool/archive/a NAME PROPERTY VALUE SOURCE labpool/archive/a dedup on inherited from labpool/archive
Children file systems inherit parameters from their parents.
Create an archive from /usr/share/man/man1, for example.
root@solaris:~# tar czf /tmp/man1.tar.gz /usr/share/man/man1
And copy it four times to the file systems we've just created. Don't forget to check deduplication rate after each copy.
root@solaris:~# cd /labpool/archive root@solaris:/labpool/archive# ls -lh /tmp/man1.tar.gz -rw-r--r-- 1 root root 3.2M Oct 3 15:30 /tmp/man1.tar.gz root@solaris:/labpool/archive# zpool list labpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT labpool 1.46G 7.99M 1.45G 0% 1.00x ONLINE - root@solaris:/labpool/archive# cp /tmp/man1.tar.gz a/ root@solaris:/labpool/archive# zpool list labpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT labpool 1.46G 12.6M 1.45G 0% 1.00x ONLINE - root@solaris:/labpool/archive# cp /tmp/man1.tar.gz b/ root@solaris:/labpool/archive# zpool list labpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT labpool 1.46G 12.7M 1.45G 0% 2.00x ONLINE - root@solaris:/labpool/archive# cp /tmp/man1.tar.gz c/ root@solaris:/labpool/archive# zpool list labpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT labpool 1.46G 12.7M 1.45G 0% 2.00x ONLINE - root@solaris:/labpool/archive# zpool list labpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT labpool 1.46G 12.5M 1.45G 0% 3.00x ONLINE - root@solaris:/labpool/archive# cp /tmp/man1.tar.gz d/ root@solaris:/labpool/archive# zpool list labpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT labpool 1.46G 12.5M 1.45G 0% 4.00x ONLINE -
It might take a couple of seconds for ZFS to commit those changes and report the correct dedup ratio. Just repeat the command if you don't see the results listed above.
Remember, we set compression to "on" as well when we created the file system? Check the compression ratio:
root@solaris:/labpool/archive# zfs get compressratio labpool/archive NAME PROPERTY VALUE SOURCE labpool/archive compressratio 1.00x -
The reason is simple: we placed in the file system files that are compressed already. Sometimes compression can save you some space, sometimes deduplication can help.
It's interesting to note that ZFS uses deduplication on a block level, not on a file level. That means if you have a single file but with a lot of identical blocks, it will be deduplicatied too. Let's check this. Create a new ZFS pool:
root@lab0:~# zpool create ddpool raidz /devdsk/c2d8 /devdsk/c2d9 /devdsk/c2d10 /devdsk/c2d11
As you remember, when we create a ZFS pool, by default a new ZFS filesystem with the same name is created and mounted. We just have to turn deduplication on:
root@solaris:~# zfs set dedup=on ddpool
Now let's create a big file that contains 1000 copies of the same block. In the following commands we are figuring out the size of a ZFS block and creating a single file of that size. Then we are copying that file 1000 times into our big file.
root@solaris:~# zfs get recordsize ddpool NAME PROPERTY VALUE SOURCE ddpool recordsize 128K default root@solaris:~# mkfile 128k 128k-file root@solaris:~# for i in {1..1000} ; do cat 128k-file >> 1000copies-file ; done
Now we can copy this file to /ddpool and see the result:
root@solaris:~# cp 1000copies-file /ddpool root@solaris:~# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT ddpool 748M 357K 748M 0% 1000.00x ONLINE - labpool 1.46G 12.4M 1.45G 0% 4.00x ONLINE - rpool 15.6G 7.91G 7.72G 50% 1.00x ONLINE -
How can this help in real life? Imagine you have a policy which requires creating and storing an archive every day. The archive's content doesn't change a lot from day to day, but still you have to create it every day. Most of the blocks in the archive will be identical so it can be deduplicated very efficiently. Let's demonstrate it using our system's manual directories.
root@solaris:~# tar cvf /tmp/archive1.tar /usr/share/man/man1 root@solaris:~# tar cvf /tmp/archive2.tar /usr/share/man/man1 /usr/share/man/man2
Clean up our /ddpool
file system and copy both files there:
root@solaris:~# rm /ddpool/* root@solaris:~# cp /tmp/archive* /ddpool root@solaris:~# zpool list ddpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT ddpool 748M 18.2M 730M 2% 1.90x ONLINE -
Think about your real life situations where deduplication could help.
Homework exercise: compress both archive files with gzip
, clean up the
/ddpool
and copy the compressed files again. Check if it affects
deduplication rate.