Pages

22 April, 2015

Started using Spark + Scala this week. Very impressive!

As the data for my dissertation is growing to become really "big data" (several GB), I was looking for new tools, beyond my trusted relational databases (PostgreSQL, MonetDB, etc.).

Spark

I found Apache Spark, which provides Python, Java, and Scala APIs to define queries on big data files. The files are served via Hadoop (delivered with Spark) to parallelize operations on the data.

Starting a Spark cluster is very easy, once you have configured the master correctly. There are some pitfalls, as Spark is very picky regarding hostnames, i.e., you better always use the full hostname with correct domains in all start scripts, config files and your application code. I won't go into the details here.

The performance of Spark is really good. It can run an M4 query on 1×10M records (200MB) in 900ms, and easily handles large data volumes, e.g. 100×1M records (2GB, 8s) or 10k×100k records (20GB, 13min). Very nice for analytical workloads on big data sources. During query execution, Spark effectively uses all 8 cores of my Macbook and I plan to improve the query response times  by running my tests on a really big cluster to provide "near-interactive" response times.

Scala

Spark is nice, but what actually motivated me for this post was to praise Scala. As a big fan of CoffeeScript, I like short (but readable) notations instead of useless repetition of names and keywords, as required in many legacy programming languages.

Scala has everything that makes a programmers life easier. Here are my favorite features:
  • Implicit variable declarations (val obj = MyType())
  • Short notation for finals (val for final values, var for variables)
  • Lambda expressions (definition of short inline, anonymous functions)
  • List comprehension (returning loop results as lists)
  • Easily passing functions as objects (as in Javascript)
  • Implicit function calls (using obj.someFunc instead of obj.someFunc())
  • Everything is an expression (no return required)
  • Short function keyword (def or => instead of function)
Awesome, I can have all these features and still get the bonus of type-safety! The code-completion in Scala IDE works quite nicely.

Here are a few Scala code examples, implementing the subqueries of my visualization-driven data aggregation  (VDDA).

Example 1: M4 grouping function.
    val Q_g  = Q_f.keyBy( row =>
      ( Math.floor( w*(row(iT) - t1)/dt ) + row(iID) * w ).toLong
    )

Example 2: M4 aggregation.
    def aggM4Rows ...
    def toRows4 ...
    val A_m4 = Q_g.map({case (k,row) => (k,toRows4(row))}).reduceByKey(aggM4Rows)   

Example 3: Counting the number of unique records.
    val recordCount = Q_m4.distinct.count  

Using Spark's Scala API makes these queries easy to define and to read, so that my Spark/Scala implementation of M4/VDDA is not much longer than the SQL queries in my research papers.

Spark + Scala = Big Data processing made easy!


Use rsync instead of scp to resume copying big data files!

For my dissertation I am conducting experiments on big data sources, such as 10k time series with 100k+ records each. The corresponding files comprise several gigabytes of data. Copying such files may take very long, since I work from a remote location, not sitting next to the data centers where the data is to be processed. Therefore, I need to be able to resume big data file uploads to the machines of the data centers.

I usually use scp to copy files between machines:
scp data/*.csv juve@machine.company.corp:/home/juve/data
Unfortunately, scp can't resume any previous file transfers. However, you can use rsync with ssh to be able to resume:
rsync --rsh='ssh' -av --progress --partial data/*.csv \
   juve@machine.company.corp:/home/juve/data 
If you cancel the upload, e.g., via CTRL+C, yo can later resume the upload using the --partial option for rsync.

Very simple. No GUI tools required. Ready for automation.