Started using Spark + Scala this week. Very impressive!

As the data for my dissertation is growing to become really "big data" (several GB), I was looking for new tools, beyond my trusted relational databases (PostgreSQL, MonetDB, etc.).

Spark

I found Apache Spark, which provides Python, Java, and Scala APIs to define queries on big data files. The files are served via Hadoop (delivered with Spark) to parallelize operations on the data.

Starting a Spark cluster is very easy, once you have configured the master correctly. There are some pitfalls, as Spark is very picky regarding hostnames, i.e., you better always use the full hostname with correct domains in all start scripts, config files and your application code. I won't go into the details here.

The performance of Spark is really good. It can run an M4 query on 1×10M records (200MB) in 900ms, and easily handles large data volumes, e.g. 100×1M records (2GB, 8s) or 10k×100k records (20GB, 13min). Very nice for analytical workloads on big data sources. During query execution, Spark effectively uses all 8 cores of my Macbook and I plan to improve the query response times by running my tests on a really big cluster to provide "near-interactive" response times.

Scala

Spark is nice, but what actually motivated me for this post was to praise Scala. As a big fan of CoffeeScript, I like short (but readable) notations instead of useless repetition of names and keywords, as required in many legacy programming languages.

Scala has everything that makes a programmers life easier. Here are my favorite features:

Implicit variable declarations (val obj = MyType())
Short notation for finals (val for final values, var for variables)
Lambda expressions (definition of short inline, anonymous functions)
List comprehension (returning loop results as lists)
Easily passing functions as objects (as in Javascript)
Implicit function calls (using obj.someFunc instead of obj.someFunc())
Everything is an expression (no return required)
Short function keyword (def or => instead of function)

Awesome, I can have all these features and still get the bonus of type-safety! The code-completion in Scala IDE works quite nicely.

Here are a few Scala code examples, implementing the subqueries of my visualization-driven data aggregation (VDDA).

Example 1: M4 grouping function.

    val Q_g  = Q_f.keyBy( row =>
      ( Math.floor( w*(row(iT) - t1)/dt ) + row(iID) * w ).toLong
    )

Example 2: M4 aggregation.

    def aggM4Rows ...
    def toRows4 ...
    val A_m4 = Q_g.map({case (k,row) => (k,toRows4(row))}).reduceByKey(aggM4Rows)

Example 3: Counting the number of unique records.

    val recordCount = Q_m4.distinct.count

Using Spark's Scala API makes these queries easy to define and to read, so that my Spark/Scala implementation of M4/VDDA is not much longer than the SQL queries in my research papers.

Spark + Scala = Big Data processing made easy!

Uwe Jugel

Search This Blog