At Roblox, network alerts are generated from a combination of different sources ( consisting of TSDB s, syslog alerting and external sources). Initially, all alerts (both low and high priority) were sent into a Slack channel where they would routinely go unnoticed during network events. Over time, as the Roblox network grew and as existing alerting rules were improved and new ones were added, the volume of alerts started to grow exponentially. These effects were amplified for events such as device outages that result in multiple alerts being generated (for links, protocols, neighboring devices, etc.).
This means we need to first identify alerts that can be easily remediated or triaged using software and then add specific rulesets to both Alert Manager and Auto-Remediation to handle those alerts. Data center devices with links that flap or experience errors have been the biggest beneficiaries of auto-remediation. This has been especially important as our data-centers (which use a typical spine-leaf design with Equal Cost MultiPath or ECMP based traffic forwarding) grow horizontally to accommodate more compute capacity. This results in more network devices at the leaf layer thereby increasing the occurrences of links experiencing errors.
As the Roblox production network has grown and scaled to meet increased player engagement, network reliability has been a primary focus area for our network engineers. A core component of network reliability is uptime which is directly influenced by the quality and robustness of the network monitoring, alerting, and remediation stack. In order to be effective, such a stack should typically have the following requirements and attributes:
Automatically plugging into a network Source of Truth (SOT) to obtain the operational status of alerting entities (devices, interfaces, etc.) and automatically suppressing alerts based on those parameters. This allows for weeding out false positive alerts and tuning out the noise where necessary.
Alert generation: Generating an alert based on a source of data. For example, based on time series data, SNMP states of various hardware components or log messages.
To opt-in for investor email alerts, please enter your email address in the field below and select at least one alert option. After submitting your request, you will receive an activation email to the requested email address. You must click the activation link in order to complete your subscription.
To opt-out of investor email alerts, please enter your email address in the field below and you will be removed from all investor relations email alerts to which you are subscribed. After submitting your email, you will receive a confirmation email to the requested email address.