22 April, 2015

Started using Spark + Scala this week. Very impressive!

As the data for my dissertation is growing to become really "big data" (several GB), I was looking for new tools, beyond my trusted relational databases (PostgreSQL, MonetDB, etc.).


I found Apache Spark, which provides Python, Java, and Scala APIs to define queries on big data files. The files are served via Hadoop (delivered with Spark) to parallelize operations on the data.

Starting a Spark cluster is very easy, once you have configured the master correctly. There are some pitfalls, as Spark is very picky regarding hostnames, i.e., you better always use the full hostname with correct domains in all start scripts, config files and your application code. I won't go into the details here.

The performance of Spark is really good. It can run an M4 query on 1×10M records (200MB) in 900ms, and easily handles large data volumes, e.g. 100×1M records (2GB, 8s) or 10k×100k records (20GB, 13min). Very nice for analytical workloads on big data sources. During query execution, Spark effectively uses all 8 cores of my Macbook and I plan to improve the query response times  by running my tests on a really big cluster to provide "near-interactive" response times.


Spark is nice, but what actually motivated me for this post was to praise Scala. As a big fan of CoffeeScript, I like short (but readable) notations instead of useless repetition of names and keywords, as required in many legacy programming languages.

Scala has everything that makes a programmers life easier. Here are my favorite features:
  • Implicit variable declarations (val obj = MyType())
  • Short notation for finals (val for final values, var for variables)
  • Lambda expressions (definition of short inline, anonymous functions)
  • List comprehension (returning loop results as lists)
  • Easily passing functions as objects (as in Javascript)
  • Implicit function calls (using obj.someFunc instead of obj.someFunc())
  • Everything is an expression (no return required)
  • Short function keyword (def or => instead of function)
Awesome, I can have all these features and still get the bonus of type-safety! The code-completion in Scala IDE works quite nicely.

Here are a few Scala code examples, implementing the subqueries of my visualization-driven data aggregation  (VDDA).

Example 1: M4 grouping function.
    val Q_g  = Q_f.keyBy( row =>
      ( Math.floor( w*(row(iT) - t1)/dt ) + row(iID) * w ).toLong

Example 2: M4 aggregation.
    def aggM4Rows ...
    def toRows4 ...
    val A_m4 ={case (k,row) => (k,toRows4(row))}).reduceByKey(aggM4Rows)   

Example 3: Counting the number of unique records.
    val recordCount = Q_m4.distinct.count  

Using Spark's Scala API makes these queries easy to define and to read, so that my Spark/Scala implementation of M4/VDDA is not much longer than the SQL queries in my research papers.

Spark + Scala = Big Data processing made easy!

Use rsync instead of scp to resume copying big data files!

For my dissertation I am conducting experiments on big data sources, such as 10k time series with 100k+ records each. The corresponding files comprise several gigabytes of data. Copying such files may take very long, since I work from a remote location, not sitting next to the data centers where the data is to be processed. Therefore, I need to be able to resume big data file uploads to the machines of the data centers.

I usually use scp to copy files between machines:
scp data/*.csv
Unfortunately, scp can't resume any previous file transfers. However, you can use rsync with ssh to be able to resume:
rsync --rsh='ssh' -av --progress --partial data/*.csv \ 
If you cancel the upload, e.g., via CTRL+C, yo can later resume the upload using the --partial option for rsync.

Very simple. No GUI tools required. Ready for automation.

25 March, 2015

Readable and Fast Math in SQL

For my dissertation, I write a lot of SQL queries, doing some Math on the data. For instance, the following query computes the relative times from a numeric timestamp t, and scales the result up by 10000.

-- query 1
with Q    as (select t,v from csv_upload),
     Q_b  as (select min(t) as t_min, max(t) as t_max from Q)
select 10000 * (t - (select t_min from Q_b))
             / (select t_max - t_min from Q_b) as tr from Q

As you can see, I use CTEs to be able to read my code ;-). However, the select statements in the final subqueries, extracting scalar values from the computed relations with one record, impair the readability of the actual Math that is to be computed.

That is why modern SQL databases allow columns from parent subqueries to be used in nested child subqueries. The following query computes the same result.

-- query 2
with Q    as (select * from csv_upload),
     Q_b  as (select min(t) as t_min, max(t) as t_max from Q)
select (select 10000 * (t     - t_min)
                     / (t_max - t_min) from Q_b) as tr from Q

Finally, another, if not the best way of writing such queries is the following.

-- query 3
with Q    as (select * from csv_upload),
     Q_b  as (select min(t) as t_min, max(t) as t_max from Q)
select 10000 * (t     - t_min)
             / (t_max - t_min) as tr from Q,Q_b

Even though all three queries are very similar, and yield the same result, I saw notable differences in query execution time. In general, query 2 was a bit slower, and query 3 was a bit faster than the others.

For my queries, using nested columns improves readability but decreases performance. If you have computed relations with one record, such as the boundary subquery Qb, it is safe to join these relations with your data.

11 March, 2015

A Case for CoffeeScript: Object Composition

I have been using CoffeeScript for over four years now (since 2011) and will never go back.1
Here is a snippet that may tell you why. It uses several basic features of CoffeeScript that make code more readable and much shorter than the vanilla JavaScript version of the same code (at the right side).

# Use CoffeeScript and stay DRY! (Don't repeat yourself)     var $f, abc1, abc2, abc3, framework;
# For instance, by using short notation {a,b,c,...}
# for object composition from variables.                       $f = framework = (function() {
#                                                              var count;
# Here is a complete example, using the notation               count = 0;
# to reduce the number of lines of code (LoC)                  return {
# of an artificial object creation framework:                    createProp: function(name, n) {
                                                                   return "This is " + name + " no. " + n;
$f = framework = do ->                                           },
  count = 0                                                      enhanceProp: function(prop) {
  createProp:  (name,n) -> "This is #{name} no. #{n}"              return "" + prop + ", enhanced!";
  enhanceProp: (prop)   -> "#{prop}, enhanced!"                  },
  createAbcObject: ->                                            createAbcObject: function() {
    # 1. basic variable setup                                      var a, b, c;
    a = $f.createProp "a",count                                    a = $f.createProp("a", count);
    b = $f.createProp "b",count                                    b = $f.createProp("b", count);
    c = $f.createProp "c",count                                    c = $f.createProp("c", count);
                                                                   if (count === 0) {
    # 2. more fiddling with the variables ...                        a = $f.enhanceProp(a);
    if count == 0 then a = $f.enhanceProp a                        }
    count++                                                        count++;
                                                                   return {
    # 3. finally compose and return the a-b-c object                 a: a,
    {a,b,c}                                                          b: b,
                                                                     c: c
abc1 = $f.createAbcObject()                                        };
abc2 = $f.createAbcObject()                                    };
abc3 = $f.createAbcObject()
                                                             abc1 = $f.createAbcObject();
# You can also use it for DRY logging
# to avoid quoting var names                                 abc2 = $f.createAbcObject();

console.log "objects created", {abc1,abc2,abc3}              abc3 = $f.createAbcObject();

# OMG! Over 50% LoC saved. Even with all these               console.log("objects created", {
# comments, CoffeeScript is still shorter and more             abc1: abc1,
# readable than the JavaScript version of the code.            abc2: abc2,
#                                                              abc3: abc3
# Stay DRY! Use CoffeeScript!                                });

1Unless somebody pays me enough money to waste my time using vanilla JS ;-).

16 February, 2015

Showing the progress of awk scripts

When running awk scripts on big data files, you may want to know how long the process will take. Here is a simple script that will output the fraction of the data that has been processed and an estimate when the processing is finished:
    ecat="cat >&2"

    if(NR%1000  == 0) {
        frac = NR/lines
        elapsed = systime() - start
        eta = elapsed/frac/60
        printf("%s %f% (ETA: %i minutes)", clear, frac*100, eta)  | ecat
The script uses the shell escape commands to reset the last printed line, so that the fraction and ETA values are always on the same line in your shell. It outputs to stderr and does not interfere with the data output to stdout. Example output: 7.061% (ETA: 4 minutes)

26 November, 2014

Switchable inline comments for LaTeX/LyX document.

For communication with my co-authors, I sometimes use inline comments, i.e., additional highlighted paragraphs withing the text of my PDF documents; exported from LyX/LaTeX. I know, I could also use PDF comments, but I like the inline style better. Here is how it looks like:

To create these comments, I use a redefinition of LyX's greyedout notes, turning them into an \fcolorbox (see code below). For exporting a printable/camera-ready PDF, I need to turn of the comments.  It took me some time to figure out how to tell LaTeX to completely ignore the comment's body text. I use the environ package for that. Here is the complete code (LaTeX preamble):


% remove old lyxgreyedout notes

% redefine lyxgreyedout notes
  \noindent %

% remove notes for printing: rename the above env and the next env
  %this will ignore all the body content in my greyedouts
And here is the rest of my LaTeX preamble; just for reference:
% I sometimes need a little more space
\newcommand{\vs}{ \textbf{\vspace{2pt}}}

% small fancy links; clickable but still readable (without the http protocol string)

% setting pretolerance to 4500 will remove most overflowing lines, esp.
% for two-column documents. In the final version of a paper, I lower this setting
% and handtune the overflows using hyphenation hints
\pretolerance = 4500

% more penalties for misplaced paragraphs,
% (usually not required for academic paper templates)
%\clubpenalty = 10000
%\widowpenalty = 10000
%\displaywidowpenalty = 10000

25 November, 2014

Linux at home + Mac at work = Goodbye Windows!

At home I had the luxury, freedom, and fun to use a Desktop Linux for nearly 10 years now. And I am quite happy with my fast and responsive Ubuntu/Gnome Desktop since 2006. When I started my professional career in 2008, I was shocked and sad to see only Windows machines; even on most servers! I think that server-side Windows was really popular at that time in Germany, even though the rest of the world was already moving forward to more OSS and to using Linux servers.

Today, over 6 years later, the situation in my company and in Germany has improved. Thanks to acquisitions, the Linux and Mac crowd was growing inside my company and also the religious devotion to Windows (in Germany) has changed. At my company, there is now a significant and growing number of Mac users and Linux servers are everywhere! Many people are remotely developing on Linux Desktops now, even if they still use a Windows machine as access point.

I am really happy to have Unix-like operating systems on all my devices now. No more fighting with missing features and cross platform software that behaves awkwardly under Windows. No more Cygwin, crappy XServer substitutes, or having to use MSBuild instead of make.

One of the most annoying things in Windows are regular freezes that are often impossible to explain. I had this 1-2 times a week and even more as the device/OS grew older. In most cases I just had to wait 2-10 minutes until the OS was responsive again. Sometimes, I had to hard reset my machine. Let's see if my new Macbook is better in this regard. I never had such issues with my Linux Desktops, or at least there was a way to kill the culprit, find an explanation, and eventually setup a mitigation to the problem. For Windows (and Mac) it is usually much harder or impossible to solve such problems.

I am a first time Mac user now and will probably run into several issues. Stay tuned for regular reports on my progress.


Update: My new Macbook also froze or had to be restarted a few times in the past month. Mainly caused by issues when logging into the corporate network, but also by software running wild. Regarding stability, I have to say that my new Mac is only slightly better than the a fresh Windows machine. Next time, I will try get a real Linux machine!

Update 2: The Mac has been running stable for months now.