Milinda Pathirage

Copying Additional Files With Gradle Application Plugin

2015-09-01T07:54:16-07:00

When building application distributions for Java apps you often need to bundle default configuration files, other resource, etc. into your application distribution. If you are using Gradle with Gradle Application Plugin for creating the application distribution for your project, you can use following code fragment in your Gradle project to copy addition files.

applicationDistribution.from("src/main/resources/conf") {
    into "conf"
}

Above code fragment copies content of conf directory into application distributions conf directory located in the root directory of your distribution.

Publishing Play Web Application Metrics to InfluxDB

2015-05-20T09:37:23-07:00

metrics-play which is a fork of metrics-play project can be used to publish Play web application metrics including JVM metrics to InfluxDB via Graphite protocol.

To add support for publishing Play application metrics to InfluxDB, first you have to add metrics-play with Graphite reporter into you Play app dependencies like below:

libraryDependencies ++= Seq(
  "com.kenshoo" %% "metrics-play" % "2.3.0_0.2.1-graphite",
  javaJdbc,
  javaEbean,
  cache,
  javaWs
)

And then you have to add the Play plugin com.kenshoo.play.metrics.MetricsPlugin to your play.plugins file. Then in the Play application configuration, you can enable and configure the Graphite reporter like below:

metrics {
  graphite {
    enabled = true
    period = 1
    unit = MINUTES
    host = localhost
    port = 2003
  }
}

Above steps assume that you have configured InfluxDB Graphite input plugin like below.

[input_plugins]

  # Configure the graphite api
  [input_plugins.graphite]
  enabled = true
  # address = "0.0.0.0" # If not set, is actually set to bind-address.
  port = 2003
  database = "play-influxdb"  # store graphite data in this database
  udp_enabled = true # enable udp interface on the same port as the tcp interface

Current implementation of this plugin doesn’t support route level metrics. I am hoping to add them based on StatsD plugin for Play.

Server Monitoring Solution Using Grafana, InfluxDB and collectd

2015-04-30T11:25:24-07:00

Couple of days ago, I wanted to add couple of new nodes to Ganglia deployment I maintained to monitor HTRC services and cluster nodes. Even though everything looks okay after installing and configuring Ganglia monitor daemons in new machines, I couldn’t get them to publish monitoring data to Ganglia’s gmetad. Worse thing was, I couldn’t find any errors (I am not sure whether I looked at the correct location, but I couldn’t find anything). I first tried to install Performance Co-Pilot with Netflix Vector, but couldn’t figure out how to setup a central metric collection server. Even though PCP and Vector combination looked really great, having to type the node host name every time I wanted to monitor a server was not what I wanted.

So I decided to give a try to Grafana, InfluxDB and collectd combination. I was able to get this setup working within couple hours with several dashboards for a subset of servers. Below is a screenshot of one of the dashboards.

In this post, I am going to discuss how to get these three tools working for a scalable and flexible monitoring solution for small scale cluster of nodes.

First of all you have to install latest InfluxDB in one of your nodes and do necessary firewall configurations to open up InfluxDB admin console port and InfluxDB back end port. You can find installation instructions here. Next thing is to install collectd in one of your nodes. If you are using Ubuntu, this is a good document on how to install and configure collectd.

Your next task is to publish stats collected by collectd daemons to InfluxDB instance you just deployed. Sometime back we had to use collectd to InfluxDB proxy to get this done. But since 0.8.4 version InfluxDB supports native collectd protocol. To make this setup work, you have to enable collectd input plugin for InfluxDB. To do this you have to add following configurations to input_plugins section of InfluxDB configuration file /opt/influxdb/shared/config.toml.

[input_plugins.collectd]
  enabled = true
  port = 8096
  database = "collectd"
  typesdb = "/usr/share/collectd/types.db"

typesdb definition is used by collectd to understand data it receives. You can copy this file from one of your servers running collectd or you can install collectd on the server you are running InfluxDB. Above example shows the scenario where we have collectd installed on the same server as InfluxDB. Just restart InfluxDB after necessary configurations are done. You can use whatever the port you like in above configuration. Also make sure to create database called collectd in InfluxDB via admin web ui or using REST API like below.

curl -X POST 'http://influxdb-host-name:8086/db?u=root&p=root' \
  -d '{"name": "collectd"}'

Next step is to configure collectd to send data directly to InfluxDB instead of storing data as rrd files. In collectd configuration file (/etc/collectd/collectd.con in Ubuntu and /etc/collectd.conf in RedHat EL) enable network plugin and configure it to send metrics to InfluxDB as shown below.

Loadplugin network
...
<Plugin "network">
  Server "influxdb-host-name" "8096" 
</Plugin "network">

Other thing is you can disable rrdtool plugin because you no longer need that (We are sending metrics to InfluxDB). Restart collectd to make the configuration changes effective.

Next you can go to InfluxDB web ui and list the time series (using ‘list series’ query) you have in your database. If you are getting data from collectd, you should see something like below:

Final step is to deploy Grafana somewhere can create some dashboard widgets to visualize time series data you are interested in like I have shown in the first image. More information on how to configure InfluxDB data source for Grafana can be found here.

Publishing Play Web Application Metrics to InfluxDB

Interesting Resources on Writing

2015-04-17T18:33:14-07:00

Starting to participate in #The100DayProject by writing every day for 100 days got me into research more about writing. Writing is major part of the life as grad student. But I was far behind my writing and I wanted to improve by writing frequently. Its well known that writing is hard and it always be hard for lot of us. But writing more and more will make you better at it. Following two articles contains some interesting ideas and tips on how to get better at writing.

‘Make Writing a Part of Your Identity’ has some interesting ideas about how to make writing a habit and why it is important to make writing a habit. I strongly suggest that article to anyone who wants to be better at writing. The article contains couple of major ideas.

You have to track writing if you want to improve writing.
You have to make writing a habit and articles has some great advice on how to make writing a habit.
Article introduces concept called Knowledge Cycle where writing process is a cycle of steps – research, read, take note and write.

’How to write 1000 words a day’ is mostly about large writing projects such as thesis. But below tips from that post will be helpful to anyone who is working on a writing project whether it is big or small.

Spend less time at your desk: Don’t spend whole day writing. Dedicate less than a quarter a day or even less than that depending on the size of your project for writing, take a break and return later to clean it up.
Remember the two hour role: Article about ‘The Two-Hour Rule’
Make writing new stuff the first thing you do when you sit down to your desk: IMO, this can be valid if you have to write full-time or for a task such as thesis writing. But in other cases this may not be completely applicable.
Start in the middle: Don’t attempt to write introductions, conclusions or important transitions first.
Write as fast as you can, not as well as you can: Worrying too much about sentences you write can slow down the process. So write as fast as you can and come back later to clean it up.
Leve it to rest… then re-write: When you write fast most of the things you have written may looks crappy. Take a break and then start re-writing.

Even though some points may not be applicable if you are working on a large writing project, I found above tips such as 'writing as fast as we can’ and ‘take a break and clean it up’ really helpful for any kind of writing.

Versioning REST APIs

2015-04-09T20:39:08-07:00

Yesterday, a discussion happened around versioning REST APIs resulted in an interesting sequence of events where one person even started to attack my self and one other person via personally e-mails. So I wanted to explore more about versioning REST APIs to understand the problem and solutions better. Lets start by looking at why versioning is needed.

As Troy Hunt discussed in his popular post on API versioning, the main reason is evolution of software. Its hard (may be even impossible) to get software right in the first release. As world moves on, new requirements may come. So introducing a new version is unavoidable.

When your API get used by various clients, you may have to maintain multiple versions. It’s not realistic to expect that everyone will migrate to new version within short time period.

There are multiple popular ways to version a REST API and there are proponents and opponents for each of these solutions.

Include the API version in the URL like http://services.digg.com/2.0/comment.bury
Adding a custom request header. For example Azure services accept custom header x-ms-version: 2014-02-14 to determine which API version to choose when responding to a client in some situations.
Modifying the Accept header to include the version like Accept: application/vnd.musicstore-v1+json

People have different views about above mentioned solutions. Some people argue that, you should always keep the resource URL as a constant. Then people argues that adding customer header sucks because it’s not the ideal way (semantically) to describe a resource. People says Accept header sucks because it is harder to test. You can no longer give a clickable URL to test the API. But as Troy has already discussed in his article there is no right and wrong way. You should go for the most practical solution for your situation.

If you are interested in this topic, you should read Troy’s post. And there are some other interesting posts as well.

How are REST APIs versioned? - Lists how popular APIs do versioning.
Versioning REST Web Services
Nobody Understands REST or HTTP

CQL - Continuous Query Language

2015-04-07T19:16:03-07:00

In today’s data driven economy, organizations depend heavily on data analytics to stay competitive. Advances in Big Data related technologies transformed how organizations interact with data and as a result more and more data is generated at ever increasing rates. And most of these data is available as continuous streams and organizations utilizes stream processing technologies to extract insights in real-time (or as data arrives). As a result of this change in how we collect and process data stream processing platforms like Apache Storm, Spark Streaming and Apache Samza were created based on about a decade of experience using Big Data processing technologies such as Hadoop.

But these modern platforms lack support for SQL like declarative query languages and require sound knowledge on imperative style programming and distributed systems to effectively utilize them. But for broader adoption, support for SQL like continuous query languages or SQL with streaming extensions is required. In this post I’m going to discuss one such language known as CQL for querying data streams invented roughly 10 years ago. Theoretical framework and SQL extensions discussed in CQL paper is still important and we are using concepts from CQL as a foundation for Apache Samza’s Streaming SQL implementation.

What is CQL? #

CQL is not SQL, but a SQL based declarative language for querying streaming and stored relations (a.k.a. database tables). Abstract semantics of CQL relies on three types of operations – stream-to-relations, relation-to-relation and relation-to-stream – on two types of data – streams and relations.

Streams and relations #

Stream - (possibly infinite) bag of elements , where s is a tuple and t is the timestamp of the element
Relation - time instant to finite but unbounded bag of tuples mapping. This is different from general definition of relation where there is no notion of time and relations in the context of CQL is know as instantaneous relations which varies with the time.

Operators #

Stream-to-relation - produces a (instantaneous) relation from a stream. Window operator (there are different types such as sliding and tumbling) is the only stream-to-relation operator available in CQL.
Relation-to-relation - produces a relation from one or more relations. Selection, projection and aggregation operators in CQL are relaiton-to-relation operators.
Relation-to-stream - produces a stream from a relation. Difference between previous and current instantaneous relation is used to convert a relation to a stream.

Stream-to-stream operators are absent and they should be constructed by combining three types of operators defined above. Below figure from CQL paper is the best visualization of abstract semantics defined in CQL.

Why CQL is interesting? #

Operators like join and some aggregation operators available in SQL are blocking and impossible to evaluate over streams. So, a window operator which divide the stream into possibly overlapping subsets is used after stream scan to reduce the scope of the query to a window extent.

In CQL, the concept of window is embedded into the semantics using the concept instantaneous relation and this allows query execution engines to implement operators such as joins and aggregations as they are operating on general relations. In addition to that, CQL allows integration of stored relations to streaming queries without any magic because once a stream is converted to an instantaneous relation, we are basically working on relations.

In addition to above mentioned semantic features, query execution strategy explained in CQL is also interesting.

CQL Query Execution #

Streams and Insert/Delete Streams #

In CQL runtime stream is represented as a sequence of timestamped insert tuples. And time-varying relation (bag of tuples) is represented as timestamped insert and delete tuples. These insertions and deletions represent the changing state of a relation. This technique makes easy to implement incremental processing of streams.

Synopses are used to maintain the intermediate state such as current contents of a sliding window or current state of a relation for join operation.

More information about CQL query execution can be found in Section 12 of CQL paper.

Limitations #

Coming soon.

Freshet - CQL based Clojure DSL for Streaming Queries (Draft)

2015-01-05T08:25:10-08:00

This blog post is still a draft.

Interest on continuous queries on streams of data has increased over the last couple of years due to the need of deriving actionable information as soon as possible to be competitive in the fast moving world. Existing Big Data technologies designed to handle batch processing couldn’t handle today’s near real-time requirements and distributed stream processing systems like Yahoo’s S4, Twitter’s Storm, Spark Streaming and LinkedIn’s Samza were introduced into the fast growing Big Data eco-system to tackle real-time requirements. These systems are robust, fault tolerant and scalable to handle massive volumes of streaming data, but lack first class support for SQL like querying capabilities. All of these frameworks provide high-level programming APIs in JVM compatible languages.

In the golden era of stream processing research, a lot of work has been done on query engines and languages for stream processing. But we have yet to adapt these work on streaming query languages to above mentioned distributed stream processing systems widely in use today.

Also with the transition from batch to real-time Big Data, different architectures were proposed to handle the integration of batch and real-time systems (Lambda Archiecture) as well as to revolutionized the way we built today’s systems (Kappa Architecture). Even though there aren’t any standards (like SQL and Relational Algebra for DBs) on implementing these architectures, Summingbird implements Lambda Architecture based on monoids. Also there are other ways to implement Lambda Architecture such as Spark’s Scala API for streaming and batch processing. Even though it is possible to implement Kappa Architecture manually using above mentioned frameworks, there aren’t any high-level frameworks like Summingbird for this purpose. Freshet tries to fill this gap by adapting continuous query semantics and execution planning methods discussed by Arasu et. al. in their paper The CQL Continuous Query Language: Semantic Foundations and Query Execution to implement Kappa Architecture on top of Apache Samza.

Before going into details about Freshet , it’s important to discuss Kappa Architecture and CQL. These are the fundamental ideas and technologies which Freshet is based on.

Kappa Architecture #

Kappa Architecture which has the notion of – Everything Is A Stream – is proposed as an alternative to Lambda Architecture. In the link above, the author argues that, stream processing is a generalization of data-flow DAGs with support for check-pointing intermediate results and continuous output to the end user. And he emphasizes that we can actually use current distributed stream processing framework like Aapache Samza combine with message queue like Kafka, which retains ordered data, to implement use cases handled by Lambda Architecture. Reprocessing is accomplished by replaying the stream through new versions of stream processing code or for a completely new algorithm.

CQL - Continuous Query Language #

CQL - aka Continuous Query Language - is a SQL- based declarative language for expressing queries over data streams and time varying relations. CQL’s abstract semantics are based on two data types - streams and relations - and three types of operations - stream-to-relations, relation-to-relation and relation-to-stream. In CQL, stream is a infinite bag of tuples and relation is a mapping from time Τ to a finite but unbounded bag of tuples. This special variant of the standard relation is called instantaneous relation in the context of CQL, because a relation R in CQL represent a finite but unbounded bag of elements at a given time instance τ. CQL takes advantage of well understood relational semantics and keeps the language simple and queries compact by introducing minimal changes to SQL.

Window specification derived from SQL-99 to transform streams to relations
Three new operators to transform time varying relations into streams.

CQL Sample - Filtering A Stream #

SELECT Rstream(*)
 FROM PosSpeedStr [Now]
 WHERE speed > 65

CQL uses SQL for relation-to-relation transformations, but relations in CQL are different from relations in SQL. CQL relations vary with time. CQL introduces two new concepts: insert/delete streams, which encode both streams and relations in a unified way, and synopses, which contain state (e.g. a counter or buffer of messages) for an operator.

Another important thing is the fact that we can use traditional database relations in CQL queries which enables us to do things like stream-relation joins common in real world applications.

Freshet #

Let’s come back to Freshet . Freshet is a first step towards a complete implementation of Kappa Architecture based on CQL to support continuous queries. Freshet implements a subset (select, windowing, aggregates) of CQL on top of Apache Samza. Freshet implements RStream and IStream relation-to-stream operators, tuple and time based sliding windows to convert streams to relations and basic relation to relation operators for implementing business logic. Following CQL, Freshet uses insert/delete stream to model instantaneous relations.

As shown in above figure, Freshet is built out of five main logical components.

Query DSL: Implemented as a Clojure DSL and used to express CQL queries against streams. Queries expressed in Freshet DSL will get compiled in to streaming relation algebra model and then will get converted into an execution plan that consists of a set of operators written as Samza stream tasks connected together as a DAG via Kafka queues.
Query Compiler: Compile SQL model generated from DSL to intermediate representation which can be converted in to execution plan.
Execution Planner: Generate execution plans (Samza jobs connected via input, intermediate and output streams to form a DAG) based on intermediate representation and current status of the Freshet cluster.
Scheduler: Does the actual scheduling of Samza Jobs.
Query Operators: Samza stream tasks. Implement CQL operators like window, select, aggregate, and view generation operators like rstream, istream. These operators, connected via intermediate streams, perform stream processing according to the query express in Freshet DSL.

Freshet DSL #

Freshet Clojure DSL is inspired by Korma Clojure DSL for SQL. Freshet DSL follows the same style as Korma DSL. There are two main constructs in the current Freshet DSL. defstream and select queries. These are two forms I am planning to support in the initial version. Other constructs will be added later.

defstream #

Used to define a new stream. Streams defined using defstream represent Kafka topics in the current implementation. New modifiers will be added to defstream in future to support different input sources. The most important modifier for defstream is the stream-fields modifier, which modifies the stream definition with a field name/type mapping. Clojure keywords are used to specify field names and these keywords will get converted to strings internally. There are pre-defined keywords for specifying types like :string, :integer. Below is how we define a wikipedia activity stream to use in stream queries.

(defstream wikipedia-activity
           (stream-fields [:title :string
                           :user :string
                           :diff-bytes :integer
                           :diff-url :string
                           :unparsed-flags :string
                           :summary :string
                           :is-minor :boolean
                           :is-unpatrolled :boolean
                           :is-special :boolean
                           :is-talk :boolean
                           :is-new :boolean
                           :is-bot-edit :boolean
                           :timestamp :long])
           (ts :timestamp))

select #

Used to define select queries over streams. Stream filtering using where and aggregators will be supported in the initial version and joins will be added next. Below is a sample select query which filters a stream.

(select wikipedia-activity
           (modifiers :istream)
           (window (unbounded))
           (where (> :diff-bytes 100)))

Query Execution #

Freshet follows the same execution semantics as CQL and flow of execution is shown below.

Window operator converts input stream into a time varying relation and the time varying relation is encoded as insert/delete stream to make it easy to implement relational operators as streaming operators. The relational part of the query is converted to a DAG of Samza operators, which will operate on insert/delete streams according to the query definition. Finally the stream materializer will materialize the stream according to specification of the original query.

Why Samza #

Freshet chose Samza for its initial implementation mainly because

Samza is fully integrated with Kafka
Samza supports and encourages stateful stream processing
Samza’s local storage is really useful for implementing CQL synopses

Samza’s property based job configuration is the only limitation when it comes to Freshet. A Storm-like topology builder would have come in handy for Freshet-like layers on top of Samza.

Current Status and Future Work #

I am working on bridging the Clojure DSL and the CQL operator layer (Samza stream tasks) currently. I plan to do the initial release within couple of weeks. After the initial release of Freshet, I am planning to contribute to Apache Samza’s Stream Query implementation, which is also based on CQL. Once finished, Freshet can be updated to use Apache Samza’s CQL operators directly, rather than having its own.

Good Reads: October 9th 2014

2014-10-09T16:46:48-07:00

Mining of Massive Datasets - This is a very interesting book which covers many topics in Big Data including Map-Reduce, Recommendation Systems, Mining Social-Network Graphs, Dimensionality Reduction and Large-Scale Machine Learning. I am currently in the 2nd chapter, but found lot of interesting things related to Map-Reduce such as modeling relation algebra using Map-Reduce which is really interesting. Those who are interested in large scale data mining can also follow the free online course from the authors of this book.
Visualizing MNIST: An Exploration of Dimensionality Reduction - This is also a really interesting post about dimensionality reduction in the context of machine learning. The post is well written, even someone not familiar with machine learning, deep learning and dimensionality reduction can read and understand underlying concepts with the help of awesome visualization you will find there.
Translating SQL into the Relational Algebra - Talks about basic concepts of translating SQL to Relation Algebra. If you are new to concepts of SQL to Relation Algebra translation, this document contains brief descriptions, examples and corner cases about most of the translation tasks such as translating subqueries and joins.

Good Reads: September 30th 2014

2014-09-30T07:05:43-07:00

Linearizability versus Serializability - Clarifies the differences between linearizability and serializability, two important properties about interleavings of operations in databases and distributed systems.
Paper Summary: High-availability distributed logging with BookKeeper - Distributed logging with high availability and many distributed readers are interested in reading the logs.
Turning the database inside out with Apache Samza - Different way of thinking about databases and how we develop database backed applications. Propose the idea of applying Stream abstraction everywhere, from database to backend web services to the UI.

Academic Writing With Markdown, Pandoc and Emacs

2014-09-25T17:50:03-07:00

LaTeX is the de-facto standard for academic writing. And there are several editors available for LaTeX. But the problem with these editors and generally editing LaTeX in any editor like Emacs is the fact that LaTeX is not writer friendly. LaTeX commands will dominate your document and can be distracting most of the time.

What we want is writer-friendly LaTeX editor as described here¹. But we don’t have such a editor with features described in [1]. One alternative I found² is to use combination of Markdown, Pandoc and LaTeX.

Basic process is as follows:

Create LaTeX header and footer files, where header includes up to the abstract and footer include bibliography and document end tag. Any package imports or new command definitions can go in header.
Write the main content in Pandoc Markdown.
Convert Markdown file to LaTeX using Pandoc.
Append generated LaTeX file and footer latex file to the header and use preferred LaTex to PDF converter.

md2latex.sh is a very simple script which automates the above process. This was based on the script from original article [2].

You can use iA Writer, WriteRoom, WriteMonkey, Write App or Emacs with writeroom-mode for distraction free Markdown editing.

This is how writeroom-mode looks in Emacs.

[2] - I couldn’t find the reference link to the original idea. Will post it as soon as I find it.