Dear SWITCHengines users,
As part of our commitment to improve the stability of SWITCHengines and modernise the available features, we have been preparing upgrades to OpenStack to be carried out over 2020. Typically an OpenStack upgrade requires between 7-10 individual components to be upgraded. To minimise service downtime we planned and tested a rolling set of updates, component by component rather than a single larger period of unavailability to upgrade the full system in one go. While many of the components can be upgraded with low risk in-service, three require brief interruptions to service and are scheduled in maintenance windows. The Keystone component for authentication has already been successfully upgraded to OpenStack Pike.
The next scheduled components are Nova, which enables provisioning and de-provisioning of virtual machines, and network agents which manage the virtual networking. The strategy will be to upgrade both components in one region, and then both in a second region.
The proposed schedule, all within maintenance windows is as follows:
Thursday 13th August 21:00-23:00 OpenStack Nova Zürich - provisioning/deprovisioning of VMs
Expected impact: A few moments of unavailability of functionality to provision, deprovision or change VMs. Running VMs are NOT impacted.
Tuesday18th August 07:00-09:00 OpenStack Network Agents Zürich - virtual networking functions
Expected impact: Virtual networking functionality will be unavailable for a few moments per network agent during this maintenance window. This may impact connectivity to individual VMs on a rolling basis.
Thursday 20th August 21:00-23:00 OpenStack Nova Lausanne - - provisioning/deprovisioning of VMs
Expected impact: A few moments of unavailability of functionality to provision, deprovision or change VMs. Running VMs are NOT impacted.
Tuesday 25th August 07:00-09:00 OpenStack Network Agents Lausanne - virtual networking functions
Expected impact: Virtual networking functionality will be unavailable in for a few moments per network agent during this maintenance window. This may impact connectivity to individual VMs on a rolling basis.
If you have any questions about the planned upgrades or have some concerns about concrete proposed dates, please let us know at engines-support(a)switch.ch<mailto:engines-support@switch.ch>
All the best,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines users,
Overnight the Object Store (S3) part of Lausanne SWITCHengines cluster had a significant number of failed processes. The remaining rebuild work on object store from that time (complete reinstall of all OSDs) is currently 50% complete. We are working to restore service, and investigating the cause in detail. In order to bring processes back on line it is necessary to temporarily disable access to S3 for a number of hours today.
Due to the rearchitecting of the cluster in June, VM/volume users are not impacted.
Next update will be in 2 hours.
If you have any issues or concerns, please let us know at engines-support(a)switch.ch<mailto:engines-support@switch.ch>
All the best,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines users,
We have published a postmortem on the extended incident for SWITCHengines storage Lausanne outlining the concrete technical issues that triggered the outage and the measures needed to mitigate and restore.
This is available at https://help.switch.ch/engines/status/outages/ to allow you to share it as needed.
If you have any questions, or would like further technical or non technical detail on the details of the incident or the followup, please contact engines-support(a)switch.ch<mailto:engines-support@switch.ch> and we will be happy to arrange a call with you.
This particularly is also an open invitation to let us know have comments or observations you would like to contribute to the evolving storage strategy at SWITCH.
We thank you for your patience during the very difficult weeks of the incident.
All the best,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>
Dear SWITCHengines Users
The situation regarding stability of the SWITCHengines cluster in Lausanne is significantly improved and we are confident that the work planned will lead to continuous improvements for our customers over the coming days until the incident can be closed.
From the week commencing June 15 we increased the number of processes available to the cluster, which increased the rate at which we were able to delete the problematic bucket data. Mid last week we reached the point that the metadata database growth rate reduced and processes can run stably without human intervention. The cluster stayed healthy over the weekend and as a result we were able to remove rate limits to all volumes in the morning of June 23.
We observed in this period that the needed deletion and compaction operations still needed result in some reduced performance on the cluster. We throttled the maintenance work in order to make the cluster as functional as possible and together with our Ceph support partners we revised our strategy to be able to proceed faster without impacting performance.
We are adding 6 new storage nodes to the cluster in Lausanne. These were expedited from our supplier, have been rack-mounted, installed and will be added to the Ceph cluster over the next week. We will migrate all customers using volumes in Lausanne to these servers, and isolate them within the cluster so that they are protected from ongoing operations with object storage. No intervention is required for this on the customer side. Once the volume services are protected, the remaining problematic buckets in object storage can then continue to be purged faster without impacting live VMs. We will also iterate through the full cluster ensuring databases are optimised and clean and remove any remaining S3 limits.
When the volume customers are safely isolated and proven stable we will close the incident and publish a postmortem.
We thank you for your patience during this time and ask if you have any questions or concerns about the above information to please let us know at engines-support(a)switch.ch<mailto:engines-support@switch.ch>.
All the best,
Ann
--
Ann Harding, Team Lead, Infrastructure & Platform as a Service,
SWITCH, Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 253 98 14, ann.harding(a)switch.ch<mailto:ann.harding@switch.ch>
Working for a better digital world - www.switch.ch<http://www.switch.ch>