Folks, I got a great response for this question. Plenty of good info. Some people sent scripts, some provided step by steps..etc. To much good info to summerize. I have included the good bits from all the emails below. Thanks go out to! bergman@panix.com "Seth Rothenberg" "Rasana Atreya" mike.marcell@cntcorp.com carl.staroscik@barclayscapital.com Birger Wathne mark@sowerby.org "Tom Jones" mgherman@ozemail.com.au "Bill Armand" John DiMarco ----Original Message----- From: Michael Cunningham [SMTP:archive@securityinsight.com] Sent: Saturday, June 03, 2000 7:58 AM To: sun-managers@sunmanagers.ececs.uc.edu Subject: Poor mans HA help needed Folks, I am currently building an HA system using two ultra2's w/ 3 A1000's dual attached. I purchased veritas volume manager and veritas fs so I could have everything fully journaled to avoid a fsck in case I have to fail over. Now I am trying to figure out how to get the disk groups to fail back and forth between the system. Boss wouldnt fork out the bucks for veritas ha product so I am writing custom perl scripts to deal with the monitoring and failover.. How do I make the disk groups fail back and forth? Do I setup all the drives on one system then add the other system in and then pull in the disk configurations somehow? Anyone know any web sites that describe how to do something like this? Anyone have any sample scripts? basic howtos? Anything would be helpful. Basically I am looking for... how do I setup volume manger on both systems to allow me to move the disk groups back and forth between the systems? How do I actually fail the disks back and forth? I have read a bit about import and export on the veritas site. Any trouble spots I should watch out for? I know I have to be very carefuly about who controls what when.. Is a fsck still possible even using vxfs? How can I avoid it at all costs? I am looking at about 200 gig of data mirrored across all 3 arrays so a fsck would be disaster:( I will summerize of course.. Thanks.. Mike ----------------------------------------------------------------------------- I did the exact same thing with two E450s and two A1000s for a customer last year. Basically you configure everything on one server, with your boot disk in rootdg and the A1000 drives in another diskgroup (we called it A1000). I have attached scripts that were manually executed to "unmount" the volumes on one server then "mounted" on the other server. Keep in mind that the SCSI_INITIATOR_ID must be different between the two servers. Feel free to email me with any questions you may have. I would recommend that you check SunSolve for the SCSI_INITIATOR_ID change, as well as the SUn Cluster docs on docs.sun.com. I've also picked up info from deja.com as well. ----------------------------------------------------------------------------- You can do... vxdg import -C disk_group vxrecover -sb -g disk_group fsck -F vxfs /dev/vx/disk_group/volume mount /volume To bring the volumes online to the new host. It would be best to unmount and deport the diskgroup from the other host, however if it is in a failover situation that may not be possible. However any scripts that you have ENSURE that the disks are not being seen by the other host or it is going to get pretty ugly. Given that vxfs is a journalling filesystem there should be significantly less time to replay the logs. The only other thing that comes to mind is that you would not want to mount these filesystems automatically on either of the hosts on boot, as that could again lead to awkward situations. As a note, I have never tried what you are looking to do, but the import / deport works fine, as I am using it in another situation. Good luck, and I'd strongly suggest that you make sure you have enough time to test thoroughly on the system prior to its going into production. ---------------------------------------------------------------------------- Have a look at www.high-availability.com, their product rsf-1 may well meet your needs and be affordable. It's very easy to Administer with rc style scripts, and has a very good framework built for the nodes to monitor each others status, not just via the network (you wouldn't want a glitch in the network to make a failover occur (spilt brain)). ---------------------------------------------------------------------------- What you have to do (from memory, so there may be holes): On host surrendering service: - unexport file systems - unmount file systems - Kill processes using file systems if neccesary - May need to stop/start lockd and statd to get rid of file locks before file system will umount - deport disk group On host taking over service: - Make sure the other host has given up file systems cleanly or is down - import disk group - fsck file systems (since these are logging file systems, fsck should normally consist of rolling forward or backward transactions in the log. You will almost never see a full fsck) - mount file systems - export file systems ---------------------------------------------------------------------------- Asuming tha you have cabled the disk to both boxes, you can just do what VCS does, vxdg deport the disk group on one server and vxdg import on the other. If the live server crashes, vxdg import on the backup server then fsck all filesysems, you`ll have t be very carefull that you don`t end with the disk group live on both servers as you`ll get corruption and probably a panic. --------------------------------------------------------------------------- HA is hard to get right! Your boss is naive if he expects you to write perl scripts to auto-failover. However, manual failover is doable by unmounting from one machine and then mounting on the other (you can't mount on both without risking corrupting the filesystem). This can be made transparent to NFS if you mount from a virtual IP address and move that IP address to the other machine. --------------------------------------------------------------------------- Yes, this can be done. I'm going to implement VCS on my servers soon, but they will be replacing our current set up. Unfortunately, these are all a bunch of csh scripts. It's actually not bad, as it's easy to debug, but I'm not a big fan of csh. Anyway, I can send them to you tomorrow. Basically, you'll need to license both servers for VXFS and VXVM, as if they were separate servers. THen make sure that both systems can see the storage array, via format. Create your volumes and mount points on systemA. Create the same mount points on systemB. If you put them in the vfstab, make sure that they're marked as NO for mount at boot on both. As far as failing over, in your scripts, you basically want to import the diskgroup to sysA, then mount each of the filesystems, start any apps and so fourth. You'll of course have to have error trapping and verify that all mounts are present and so on. For the failover, you basically have to reverse order what you just did: Stop all applications that are running from the mounts, verify that nothing is present or running on these filesystems (fuser -ck FILESYSTEM), unmount the filesystems, then deport the disk group. Once this is done, the other server can run the same startup script to import the disk groups, and so on. Hopefully, you're familiar with Veritas Administration, as you'll need to know all of the command lines, but that's pretty easy. Obviously, both systems need to know about each other, and have ways of verifying which state they're in, who has control of the disks, all disks actually get unmounted/mounted, and so on. Since our scripts are not that robust (they've been a legacy for so long before any of my group came along), we have 24x7 operators that are available to manually issue commands to failover. But to automate it, isn't really that much more difficult. Once you can successfully flop back and fourth with your scripts, it's just a matter of creating another code to monitor, and based off of your conditions, run the failover script. This is really all VCS does. It's really just a pretty script to do the monitoring, although very cool and lots of bells and whistles. VCS uses ethernet heartbeats to test for system uptime. If you have multiple interfaces (or a quad card), you can also do this very easily. Connect your ethernet cable from one system to the next, give ipaddrs/hostnames (hb1, hb2, etc) and do a ping from one system to the other on a set schedule and verify that the ping comes back. That's really all the heartbeats do. You can just as easily use your public interfaces as the heartbeats. After that, you have to set your conditions on which to failover, how critical and so on. -------------------------------------------------------------------------- I had a similar situation in my previous company. 2 E450s with a dual-attached A5000. The only this was I did not have to do hot fail overs. However, my ideas might be of help: #!/bin/sh # ######## # # Script to either: # - deport yantra diskgroup from machine yantra03 (production), so it # may be imported on machine yantra02 (backup). # - deport yantra diskgroup from machine yantra02 (backup), so it may # be imported on machine yantra03 (production). # # Rootdg cannot be moved from one system to another system; so to use # two hosts to one ssa box, there must be other diskgroups involved. while true do echo '' echo 'To deport diskgroup yantra (make unavailable) type: d' echo 'To import diskgroup yantra (make available) type : i\n' echo 'Please make selection:' if read SELECTION then case $SELECTION in d) echo '\nPreparing to deport diskgroup yantra.' break ;; i) echo '\nPreparing to import diskgroup yantra.' break ;; *) echo '\nInvalid response! Script being aborted.' exit 0 ;; esac fi done while true do echo '' echo 'WARNING if deporting!!!\n' echo 'This script will make the database unavailable on this machine.' echo 'Shutdown database and Yantra BEFORE running this script.' echo 'Remember to run the corresponding script on alternate machine.\n' echo 'Continue? [y,n]: ' if read RESPONSE then case $RESPONSE in n|N) echo 'Diskgroup yantra unchanged.\n' exit 0 ;; y|Y) break ;; *) echo 'Invalid response! Script being aborted.\n' exit 0 ;; esac fi done if test "$SELECTION" = d then # Prepare to failover to the backup machine yantra03 umount /db/data01 umount /db/data02 umount /db/data03 umount /db/data04 umount /app/yantra umount /oracle umount /db/data05 umount /backup umount /db/archive umount /db/export umount /db/data06 umount /db/data07 umount /db/data08 umount /db/data09 vxvol -g yantra -o verbose stopall vxdg deport yantra cp /etc/vfstab /etc/vfstab.save cp /etc/vfstab.nodb /etc/vfstab echo 'Remember to remove VxVM processes from root crontab.\n' else # Failback to the production machine yantra02 vxdg import yantra vxvol -g yantra -o verbose startall mount /db/data01 mount /db/data02 mount /db/data03 mount /db/data04 mount /app/yantra mount /oracle mount /db/data05 mount /backup mount /db/archive mount /db/export mount /db/data06 mount /db/data07 mount /db/data08 mount /db/data09 cp /etc/vfstab /etc/vfstab.save cp /etc/vfstab.db /etc/vfstab echo 'Remember to setup VxVM processes in root crontab.\n' echo 'See /etc/scripts/cron-schedule-yantra03' fi exit 0 -------- Ofcourse, the problem with this script is everything is hard coded. But it should give you the general idea. --------------------------------------------------------------------- I don't have any experience with Veritas file-management or volume management, (we use SDS), but I do have exp with Veritas FirstWatch HA. The capabilities it provides require a lot of tuning....enough that you may break even on writing them yourself. The hard part is reliably recognizing when the remote system has died, etc. You might look for information in the Linux arena on a package called "HeartBeat". It should be portable enough to run on Solaris, and it should provide the most crucial part of the HA package for you. --------------------------------------------------------------------- On a real Sun cluster (2.2, a very, very, very simplified picture), all the disk configuration information is stored in a private region (private to the cluster software only, not used for data) on the shared disk hardware. There are references in the Veritas config. file on each node that point to that private region. In the simplest case, a disk group is only mounted on one machine at a time. When the current master node fails (and the disk group is no longer mounted), another node detects the failure, mounts the private region specified, then completes the disk group import based on the data it finds in the private region. => Any trouble spots I should watch out for? I know I have to be => very carefuly about who controls what when.. Is a fsck still => possible even using vxfs? How can I avoid it at all costs? Let's see.... not detecting a failure too sensitive a detection (ping-pong from one server to another) split-brain problems (both (N) servers think they are the master) corruption of shared configuration data contention for "master" status at startup (particularly in the case when multiple servers start up simultaneously--such as after a power failure) problems with starting applications after a failover, and preservation of the state of application data (different from the accurate, uncorrupt storage of the data itself...for example, a cluster may take 30 seconds to fail over, but the Oracle app may take 2hrs to roll back the logs that were saved so it's at the correct state) There are a number of white papers and technical docs on docs.sun.com or sunsolve.sun.com describing the design and administration of a cluster. While the references are to Sun's solution, you can gain a lot from their analysis of the problems.