Failure Recoveries
Recovering from a Link Failure / R2 Site Failure - Normal Establish
If the SRDF link between sites fails or the R2 site fails, the RDF pair state will be partitioned.Once the issue with the link/site has been resolved the RDF pair state will change to suspended.
In this case service remains running on the R1 devices and the R2 devices remain Write Disabled. In this case the SRDF device groups just need to be established to resume normal replication.
Before running a establish a clone copy of the R2's should be taken. This would allow recovery to a consistent point at the R2 site should there be a disaster at the R1 site while the establish is in progress.
- Disable consistency on the group: symrdf -g srdf_APP01 disable
- If the link has been down for some time the establish should be initiated in adaptive copy mode. symrdf -g srdf_APP01 set mode acp_disk
- Once in the appropriate mode run the establish. This will sync the R2's with the R1's: symrdf -g srdf_APP01 establish
- Run a query to verify the disks are synchronised: symrdf -g srdf_APP01 query
- Once the number of invalid tracks approaches 0 set the mode back to async and enable consistency on the group:
symrdf -g srdf_APPR01 set mode async
symrdf -g srdf_APP01 enable
Recovering from a R1 site failure
This procedure will reverse the direction of replication so proceed with caution!
This procedure assumes that the R1 site has failed and VCS has brought service up on the R2 devices. In this case the RDF pair state will be partitioned and the R2 devices would have been forced into a RW state to allow service to start at the R2 site. Once the problems at the R1 site have been resolved the RDF pair state will be split and both R1s and R2s will be in a RW state.
This procedure will swap the personality of the R1 and R2 devices. The R2's that are being used to provide serivce will become R1s andthe old R1 devices will become R2s.
Before running an establish a clone copy should be taken. This will allow recovery to consistent point at the respective site should there be an issue while the establish is in progress.
Once this procedure is complete, Unix Admins can use VCS to manually failover service so that the original mode of operation is restored.
- Run a query on the device group: symrdf -g srdf_APP01 query
- Once the links are brought back up it should show that the state is split because both R1s and R2s are RW enabled.
- Verify with UNIX admins that the services are running on the R2 site
- Write disable the R1's. This will change the state to 'Failed Over': symrdf -g srdf_APP01write_disable r1 -force
- Issue the swap command so that the R1s and R2s are reversed. This command will also throw away the old deltasets on the R1 side because all the updates have been made on the R2 side. You will need to disable consistency on the group first:
symrdf -g srdf_APP01 disable
symrdf -g srdf_APP01 swap -refresh -r1 { Causes R1 to refresh from the remote mirror}
- If the link has been down for some time then change to adaptive copy mode before initiating an establish:
symrdf -g srdf_APP01 set mode acp_disk
symrdf -g srdf_APP01 establish
- Once the disks approach full sync then change mode back to async mode and enable consitency on the device group:
symrdf -g srdf_APP01 set mode async
symrdf -g srdf_APP01 enable
- Everything is now in a position for the cluster to be shutdown and brought up at the original primary site if required.