Copyright © 2020 The Apache Software Foundation.

#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

Developer notebook for Apache SystemDS

Run this notebook online at Google Colab ↗.

This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the

A. standalone mode \ B. with Apache Spark.

Flow of the notebook:

Download and Install the dependencies
Go to section A or B

Download and Install the dependencies

Runtime: Java (OpenJDK 8 is preferred)
Build: Apache Maven
Backend: Apache Spark (optional)

Setup

A custom function to run OS commands.

def run(command):
  print('>> {}'.format(command))
  !{command}
  print('')

Install Java

Let us install OpenJDK 8. More about OpenJDK ↗.

!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# run the below command to replace the existing installation
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

!java -version

Install Apache Maven

SystemDS uses Apache Maven to build and manage the project. More about Apache Maven ↗.

Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find pom.xml find the codebase!

maven_version = 'apache-maven-3.6.3'
maven_path = f"/opt/{maven_version}"

if not os.path.exists(maven_path):
  run(f"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip")
  run('unzip -q -d /opt apache-maven.zip')
  run('rm -f apache-maven.zip')

# Let's choose the absolute path instead of $PATH environment variable.
def maven(args):
  run(f"{maven_path}/bin/mvn {args}")

maven('-v')

Install Apache Spark (Optional, if you want to work with spark backend)

NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at https://spark.apache.org/downloads.html

spark_version = 'spark-2.4.7'
hadoop_version = 'hadoop2.7'
spark_path = f"/opt/{spark_version}-bin-{hadoop_version}"
if not os.path.exists(spark_path):
  run(f"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz")
  run('tar zxfv apache-spark.tgz -C /opt')
  run('rm -f apache-spark.tgz')

os.environ["SPARK_HOME"] = spark_path
os.environ["PATH"] += ":$SPARK_HOME/bin"

Get Apache SystemDS

Apache SystemDS development happens on GitHub at apache/systemds ↗

!git clone https://github.com/apache/systemds systemds --depth=1
%cd systemds

Build the project

# Option 1: Build only the java codebase
maven('clean package -q')

# Option 2: For building along with python distribution
# maven('clean package -P distribution')

A. Working with SystemDS in standalone mode

NOTE: Let's pay attention to directories and relative paths. :)

1. Set SystemDS environment variables

These are useful for the ./bin/systemds script.

!export SYSTEMDS_ROOT=$(pwd)
!export PATH=$SYSTEMDS_ROOT/bin:$PATH

2. Download Haberman data

Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival

About: The survival of patients who had undergone surgery for breast cancer.

Data Attributes:

Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute)
- 1 = the patient survived 5 years or longer
- 2 = the patient died within 5 year

!mkdir ../data

!wget -P ../data/ https://web.archive.org/web/20200725014530/https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

# Notice that the test is plain csv with no headers!
!sed -n 1,10p ../data/haberman.data

2.1 Set `metadata` for the data

The data does not have any info on the value types. So, metadata for the data helps know the size and format for the matrix data as .mtd file with the same name and location as .data file.

!echo '{"rows": 306, "cols": 4, "format": "csv"}' > ../data/haberman.data.mtd

# generate type description for the data
!echo '1,1,1,2' > ../data/types.csv
!echo '{"rows": 1, "cols": 4, "format": "csv"}' > ../data/types.csv.mtd

3. Find the algorithm to run with `systemds`

!ls

!ls scripts/algorithms

# Output the algorithm documentation
# start from line no. 22 onwards. Till 35th line the command looks like
!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml

!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE

3.1 Let us inspect the output data

!sed -n 1,10p ../data/univarOut.mtx

B. Run SystemDS with Apache Spark

Playground for DML scripts

DML - A custom language designed for SystemDS with R-like syntax.

A test `dml` script to prototype algorithms

Modify the code in the below cell and run to work develop data science tasks in a high level language.

%%writefile ../test.dml

# This code code acts as a playground for dml code
X = rand (rows = 20, cols = 10)
y = X %*% rand(rows = ncol(X), cols = 1)
lm(X = X, y = y)

Submit the dml script to Spark with spark-submit. More about Spark Submit ↗

!$SPARK_HOME/bin/spark-submit \
    ./target/SystemDS.jar -f ../test.dml

Run a binary classification example with sample data

One would notice that no other script than simple dml is used in this example completely.

# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml