Generating Events

The generate action requires special discussion because it is a way to create custom events and alerts that are aware of the history of the data.

A stream of JSON events is read and passed through. generate saves these records in a SQLite database so it can use the full power of SQL to generate historical queries over aggregates such as averages and maximums.

- generate:
      bbox.linkutilisation.incoming:
        add:
          - severity: warning
          - type: alert
          - title: "Line Utilisation incoming over 90%"
          - text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${threshold:1}"
        let:
           avg_incoming: AVG(incomingBytesPerInterval, 5m)
           max_incoming: MAX(incomingBytesPerInterval, 60m)
           threshold: 90.0/100.0
        when: (avg_incoming / max_incoming) > threshold

In short, we take the 15 minute average over incoming byte rate and the 60 minute maximum, and create an alert when the ratio of the received average and the maximum is greater than some threshold.

Under let, we define variables, which may be simple expressions or may contain aggregate functions like AVG or MAX.

All of the input fields in the JSON events can be used in these expressions.

when is the condition that must be true for the alert to be generated.

Note that that the title and text may contain ${var:1}, where var is any defined variable or known field. (The ":1" means "one decimal place" which helps to keep reports clean)

severity can be 'info', 'warning' or 'error'. The name of the alert is an aggregation key which uniquely identifies the type of alert.

This produces the following alert events:

{
    "type":"alert",
    "text":"average incoming 84kb close to max incoming 84kb: ratio: 0.9",
    "title":"Line Utilisation incoming over 90%",
    "aggregation_key":"bbox.linkutilisation.incoming",
    "severity":"warning",
    "@timestamp":"2018-05-01T12:00:00.000Z"
}

Filtering Alerts

If you only want to see the alerts, use filter to only pass through events that have a field type which has value "alert":

- filter:
    patterns:
        - type: alert

Here is a similar example. Here the latest DNS timeToRespond is compared against avg_lookup_hour.

- generate:
    alert.dns.benchmark:
        #test: true
        add:
            - type: alert
            - severity: warning
            - title: "DNS Lookup over ${threshold}% ${destinationHostname} (${min})"
            - text: "Lookup time ${timeToRespond} greater than hour average ${avg_lookup_hour}: ratio: ${ratio:1}"
        notification: 5m
        any: destinationHostname
        let:
            avg_lookup_hour: AVG(timeToRespond,60m)
            threshold: '150'
            ratio: timeToRespond/avg_lookup_hour
        when: ratio > (threshold/100.0)

But there is an interesting twist. Say our input records look like this:

{"time":"2018-05-01 12:00:00.050","destinationHostname":"example.com","timeToRespond":200.000000,"min":1}
{"time":"2018-05-01 12:00:00.100","destinationHostname":"frodo.co.za","timeToRespond":70.000000,"min":1}
{"time":"2018-05-01 12:00:00.150","destinationHostname":"panoptix.co.za","timeToRespond":50.000000,"min":1}
....

Resolving different hosts can take different times, and generally people only want to know when there's a sudden relative change in resolution time for a particular host. any: destinationHostname ensures that we track the average time to respond individually for each host.

Grouping Events by ID

Sometimes a tool used as a probe does not produce single events, but a set of JSON records.

The data for a trace toute Pipe could look something like this, simplified to just show the fields of interest.

{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":1 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":4 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":7 ...}
{... "scanUUID":"b41657c3-8ba1-42a8-a750-66291015e26a","hopNumber":10 ...}

{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":19 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":6 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":7 ...}
{... "scanUUID":"d654f821-fcff-48d4-a297-4ecc261d6154","hopNumber":12 ...}

What the records have in common is a unique ScanUUID, and what we want to do is look at the maximum hopNumber for a particular id, and compare the last event's number with the current event's number.

Previously the aggregate functions took time intervals, and here they take a field with a record index:

- generate:
    alert.bbox.hopCount:
        add:
            - severity: warning
            - type: alert
            - title: "Change of Hop Count"
            - text: "From ${hopCount1} to ${hopCount2}"
        at_end: true
        let:
            hopCount1: MAX(hopNumber,scanUUID:1)
            hopCount2: MAX(hopNumber,scanUUID:0)
        when: ABS(hopCount1-hopCount2) > 2

More than one Alert

You may define multiple alerts watching the same input events:

- generate:
    bbox.temperature:
        add:
            - severity: warning
            - type: alert
            - title: "High Bbox Temperature detected"
            - text: "temperature is ${temperature:1}"
        let:
            temperature: AVG(cpuTemperature,15m)
        when: temperature > 70

    bbox.memusage:
        add:
            - severity: warning
            - type: alert
            - title: "High Bbox Memory Usage detected"
            - text: "memory used ${memUsedPerc}%"
        let:
            memoryUsageAvg: MAX(memoryUsage,15m)
            totalMemoryAvg: MAX(totalMemory,15m)
            memUsedPerc: (memoryUsageAvg/totalMemoryAvg)*100
        when: memUsedPerc > 90

    bbox.diskusage:
        add:
            - severity: warning
            - type: alert
            - title: "High Bbox Disk Usage detected"
            - text: "memory usage is ${usagePercentage}"
        let:
            usagePercentage: AVG(rootPartitionUsagePercentage,15m)
        when: usagePercentage > 80

    bbox.loadaverage:
        add:
            - severity: warning
            - type: alert
            - title: "High Bbox Load detected"
            - text: "load average ${loadAverage5mAvg:1}"
        let:
            loadAverage5mAvg: AVG(loadAverage5m,60m)
        when: loadAverage5mAvg > 3

    bbox.uptime:
        add:
            - severity: warning
            - type: alert
            - title: "Rebooted"
            - text: "uptime of ${uptime}"
        let:
            uptime: AVG(uptimeSeconds,120s)
        when: uptime > 120

Enriching Events

The main requirement of an alert is that it should clearly indicate where it came from. The above alert definitions would be fairly useless without some idea of the site which is in trouble, and there are 'magic' numbers like 3 and 80 all over the place.

In the case of line saturation alert, it would be good to parameterize the alert so that we can change the threshold easily.

This is a full Pipe definition, where threshold has been defined in Context. Note that the existing use has been renamed ratio, which is more accurate.

name: line_saturation
input:
    file: incoming.json
context:
    threshold: 90
actions:
- generate:
      bbox.linkutilisation.incoming:
        add:
          - severity: warning
          - type: alert
          - title: "{{name}} Line Utilisation incoming over {{threshold}}%"
          - text: "average incoming ${avg_incoming}kb close to max incoming ${max_incoming}kb: ratio: ${ratio:1} (site {{name}})"
        let:
           avg_incoming: AVG(incomingBytesPerInterval, 5m)
           max_incoming: MAX(incomingBytesPerInterval, 60m)
           ratio: '{{threshold}}/100.0'
        when: (avg_incoming / max_incoming) > ratio
- filter:
    patterns:
        - type: alert
output:
    write: console

You will note that context variables expand with {{threshold}}, not ${threshold}. This is because they are constants that are defined when the Pipe is created, and 'dollar curlies' are for field values and values calculated from them for each incoming event (just as with the add action).

{{name}} always exists, and by will be set to the Target name.

This is clearer, because we don't repeat the magic number 90 all over the place. But the real power comes from contexts... you may change threshold for all Targets in the global Context, change it for a particular Target in Target Context, and use Tag Context to set it for particular groups of Targets.