Monitoring Ruby applications with Heka and Grafana

Created by Maciej Mensfeld / @maciejmensfeld / mensfeld.pl

Why would you even bother?

  • Because things not always work as they should
  • Because hardware fails
  • Because good performance is a constant requirement
  • Because monitoring helps catching issues

But how?

It depends ;)

There are no universal solutions

So let's have a small test-case

  • SOA based architecture
  • 20+ applications (and many more in development)
  • Few Rails based, mostly Sinatra
  • JSON API based endpoints
  • So let's have a small test-case

  • 6 internal gems used by those apps
  • Moving towards zero IO apps (no writes to HDD)
  • Getting ready for fully ephemeral Docker containers
  • Heavy duty applications (500-700 req/s)
  • Multiple databases (200GB+)
  • Naive approach

    
    around_filter :method do |task|
      benchmark(task.name) do
        task.process
      end
    end
    					
    
    def benchmark(task_name)
      t = Time.now
      result = yield
      # Mongoid simple object
      Usage.create!(
        task_name: task_name,
        time_taken: (t.to_f - Time.now.to_f)*1000
      )
    end
    					

    Naive approach

    • Seemed to work
    • Seemed to be fast enough
    • But it wasn't :-(
    • And if you forget about MongoDB TTL you end up with 500GB+ of logs

    What would Jesus do?

    He probably doesn't care ;)

    But mozilla does!

    Heka is an open source stream processing software system developed by Mozilla. Heka is a “Swiss Army Knife” type tool for data processing

    Be friends with Heka because...

    • It works...
    • ... out of the box
    • There's a Docker container with it
    • It is fast
    • It accepts statsd messaging format (via plugin)
    • Accepts UDP packages
    • Squashes data for you
    • Works great with InfluxDB

    So let's do this better!

    We know that:

    • Storing data locally is too slow
    • Storing data locally is bad
    • TCP requests slow down business logic
    • Some losses are acceptable (fire & forget)
    • Polluting business logic with monitors is bad

    What do we really need?

    • Way to hook up to anything easily
    • How often
    • How fast
    • How many errors
    • Nice charts
    • Cross app comparisons

    What can we do?

    • Let's use AOP to wrap around monitoring logic
    • UDP should be more than enough
    • Heka to collect data
    • InfluxDB to store it
    • Grafana to graph it
    • Connection pool to avoid killing GC
    • Let's also wrap it as a gem :)

    Aspect Oriented Programming

    In computing, AOP is a programming paradigm that aims to increase modularity by allowing the separation of cross-cutting concerns.

    There are many AOP libraries for Ruby. We use Aspector

    With AOP we can create concerns that we can attach to any method of any class as a before, around and after action.

    Simple aspect example

    
    class Handler < BaseHandler
      def handle
        # Sends a UDP packet about method usage
        Usage.increment(key)
      end
    
      before options[:method], aspect_arg: true do |aspect|
        aspect.handle
      end
    end
    
    # Counter for Sinatra app requests
    Handler.apply(App, method: :call, key: :request)
    
    # Counter for number of invokations of process method
    Handler.apply(Processor, method: :process, key: :processor)
    					

    More examples

    
    # Monitor number of invokations, time taken and number of errors
    Usage::ComplexHandler.apply(
      App, method: :call, key: :request
    )
    Usage::ComplexHandler.apply(
      FbService, method: :find, key: :fb_find
    )
    
    # We can always track how often do we save Mongoid objects
    Usage::IncrementHandler.apply(
      Mongoid::Document, method: :save, key: :mg_save
    )
    					

    Let's gra(fana)ph it!

    • Rich graphing
    • Mixed styling
    • Multiple dashboards
    • InfluxDB query editor
    • Annotations
    • Multiple data sources
    • JSON graph editor

    JSON editor

    
    {
      "target": "",
      "function": "max",
      "column": "value",
      "series": "stats.request.times.upper",
      "query": "select max(value) from \"stats.request.times.upper\"",
      "groupby_field": "",
      "alias": "Requests time upper"
    },
    					

    Downsides

    • Self maintained
    • Hard to debug
    • Silently fails when network issues
    • Needs some configuration between components
    • Won't auto addapt to application changes
    • Useless for non-app (proxy, etc) level issues

    THE END - Q & A

    Maciej Mensfeld

    - Aspector
    - Heka
    - InfluxDB
    - Grafana