In the beginning, Linux was a free general purpose OS and it was not clear how Linux companies would generate profits out of it. In 1999 RedHat company went public and started to develop a real business plan. After a few years, in 2003, one of its main competitors, SuSE Linux, was acquired by Novell. Since then, both companies worked hard to reduce their involvement in desktop solutions and develop a segment known as “server market”.
One of the key technologies of enterprise server market is Storage Area Network: an infrastructure that abstracts storage resources. When Linux companies started to compete in server market, Linux had support for accessing SAN storages (Fibrechannel and iSCSI drivers), advanced disk partitioning support (LVM and EVMS), but no free shared-storage filesystem. So RedHat acquired Sistina’s GFS, a shared-storage filesystem, imported some work from OpenGFS developers, released it under Open Source license and evolved it to GFS2.
In the meanwhile, Novell looked around and found that Oracle had an ongoing open source project named OCFS2. It was a general purpose refactoring of original OCFS filesystem, which Oracle had developed years before to deal with clustering of its database product. So Novell decided to integrate OCFS2 into its Suse Linux Enterprise Server platform and advertise it as their top-notch mature filesystem for clustering in SAN environment.
Unluckily, what Novell marketing dept didn’t actually know is that OCFS2 has never been production-ready, yet.
In the last two years, I’ve deployed a number of OCFS2 filesystems with Novell SLES 10 SP2 and experienced the following troubles:
- In certain situations, filesystem reports “Not enough disk space” even if df reports 50-60% usage, due to a bug in inode allocation when disk is very fragmented. This bug was reported over two years ago and is still “in the wild”!
- If a node crashes, it has no support for intelligent fencing like RedHat’s, so if your cluster has several nodes, you may need to restore quorum manually.
- There are several racing conditions in file locking that lead to corruption in shared bdb databases or similar faults.
- Sharing OCFS2 folders with Samba on the nodes crashes the kernel, due to a bug in distributed locking routines. This bug was posted over one year ago and is still marked as “NEW” in Oracle’s bugzilla.
- In the event of a system crash, OCFS2 may not recover automatically and needs a fsck. In this case, fsck takes forever, may report critical errors and finally fail, leaving the filesystem unusable and unrecoverable.
- Restoring from backup a SAN filesystem of several Terabytes on OCFS2 takes longer. How longer? More.
Any attempt to fix these problems using Novell rpm packages, Oracle-released source packages, Linux stock kernels, Linux experimental branches and patches found on bugzilla failed miserably. Furthermore, it’s pretty clear that Oracle treats users as if they were beta-testers.
Buyers beware: OCFS2 sucks.
Is GFS2 any better? Yes (it’s really designed as an enterprise product and integrated with RedHat clustering suite), but it’s still too slow for enterprise applications.
Bottom line: Don’t believe the marketing vapor, Linux on a SAN in 2010 is still a no-go.