Dear SWITCHengines Users
The situation regarding stability of the SWITCHengines cluster in Lausanne is significantly improved and we are confident that the work planned will lead to continuous improvements for our customers over the coming days until the incident can be closed.
From the week commencing June 15 we increased the number of processes available to the cluster, which increased the rate at which we were able to delete the problematic bucket data. Mid last week we reached the point that the metadata database growth rate reduced and processes can run stably without human intervention. The cluster stayed healthy over the weekend and as a result we were able to remove rate limits to all volumes in the morning of June 23.
We observed in this period that the needed deletion and compaction operations still needed result in some reduced performance on the cluster. We throttled the maintenance work in order to make the cluster as functional as possible and together with our Ceph support partners we revised our strategy to be able to proceed faster without impacting performance.
We are adding 6 new storage nodes to the cluster in Lausanne. These were expedited from our supplier, have been rack-mounted, installed and will be added to the Ceph cluster over the next week. We will migrate all customers using volumes in Lausanne to these servers, and isolate them within the cluster so that they are protected from ongoing operations with object storage. No intervention is required for this on the customer side. Once the volume services are protected, the remaining problematic buckets in object storage can then continue to be purged faster without impacting live VMs. We will also iterate through the full cluster ensuring databases are optimised and clean and remove any remaining S3 limits.
When the volume customers are safely isolated and proven stable we will close the incident and publish a postmortem.
We thank you for your patience during this time and ask if you have any questions or concerns about the above information to please let us know at engines-support(a)switch.ch<mailto:engines-support@switch.ch>.
All the best,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines users,
As part of our commitment to improve the stability of SWITCHengines and modernise the available features, we have been preparing upgrades to OpenStack to be carried out over 2020. To minimise service downtime we planned and tested a rolling set of updates, component by component rather than a single larger period of unavailability to upgrade the full system in one go.
Maintenance to SWITCHengines to upgrade the authentication component Keystone is planned to take place on the morning of Tuesday June 23rd between 07:00-09:00
What can you expect for impact?
This change is global across both regions.
Significant refactoring took place for the previous upgrade to Ocata that now allows us to upgrade with only a few moments downtime within the longer scheduled maintenance window.
No immediate changes will be necessary on the user side to resume using the service
The upgrade improves the security of SWITCHengines by improving password encryption options to be selected for new credentials and makes possible performance improvements once the upgrades are complete.
If you have any questions or concerns about the planning or scheduling of this maintenance window, please let us know at engines-support(a)switch.ch<mailto:engines-support@switch.ch>
Best regards,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines Users
Today, the LS cluster suffered again from crashing OSDs starting at noon and lasting until 18PM. This impacted VM volumes (ceph-standard only) and object storage which are backed by the Ceph storage. This outage occured after multiple disks (OSDs) requested a massive amount of RAM. We are still investigating the root cause of the increased demand whereby we are in contact with the Ceph developers and the Ceph community. We could resolve the problem and the storage is available again.
As we already announced, we will isolate the volumes from object storage. Today's outage concludes to the decision that we will start migrating the data tonight. We will send another update on Monday in order to give you more information about the cluster state.
We really apologize for any inconvenience you have experienced.
Kind regards,
Your SWITCHengines Team
Dear SWITCHengines users,
We have observed control plane issues in Zürich prior to our scheduled maintenance this morning which caused the maintenance to be cancelled.
While existing VMs running in steady state are not affected, users may experience difficulties starting, stopping or resizing VMs in Zürich at this time. We are investigating and may need to periodically restart control plane services during debugging.
Next update at 12:00.
Regards,
Ann Harding
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines users,
Over the course of this week we have been able to accelerate some of the remedial measures to remove the problematic data from the cluster and to make more resources available to it. We have been able as a result to update our recovery plan to be able to accommodate an increased rate of deletions and backfills as we move data within the cluster to safer placement locations.
We have been advised while the work is ongoing that it is important to regularly compact internal databases in the cluster to maintain the upward trajectory. We have observed that this has had the effect of periodically causing slow ops for some customers and are evolving how and when we do this to minimise impact. If you experience any difficulty, please let us know at engines-support(a)switch.ch<mailto:engines-support@switch.ch>
Please accept our sincere apologies for this incident and the exceptionally long duration of the recovery measures. At this point we would like to thank you for your continued patience and assure you that we are working as hard as we can to restore normal operations while keeping your data available to you.
Best regards,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines users,
We have observed instability in the SWITCHengines cluster in LS overnight, connected with deletions necessary after the long duration incident earlier this month. A range of data became inaccessible. Some of this has been restored but currently a pool of data used by Ceph FS (posix) is still unavailable which impacts the ability to carry out some operations on SWITCHengines VMs e.g. shelving or snapshotting. We are working to restore the affected groups and will issue an update in 2 hours or sooner.
Regards,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>