Title
Copyright © 2020 The Apache Software Foundation.
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------
Run this notebook online at Google Colab ↗.
This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the
A. standalone mode \ B. with Apache Spark.
Flow of the notebook:
- Download and Install the dependencies
- Go to section A or B
def run(command):
print('>> {}'.format(command))
!{command}
print('')
Install Java
Let us install OpenJDK 8. More about OpenJDK ↗.
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# run the below command to replace the existing installation
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!java -version
Install Apache Maven
SystemDS uses Apache Maven to build and manage the project. More about Apache Maven ↗.
Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find pom.xml
find the codebase!
maven_version = 'apache-maven-3.6.3'
maven_path = f"/opt/{maven_version}"
if not os.path.exists(maven_path):
run(f"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip")
run('unzip -q -d /opt apache-maven.zip')
run('rm -f apache-maven.zip')
# Let's choose the absolute path instead of $PATH environment variable.
def maven(args):
run(f"{maven_path}/bin/mvn {args}")
maven('-v')
NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at https://spark.apache.org/downloads.html
spark_version = 'spark-2.4.7'
hadoop_version = 'hadoop2.7'
spark_path = f"/opt/{spark_version}-bin-{hadoop_version}"
if not os.path.exists(spark_path):
run(f"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz")
run('tar zxfv apache-spark.tgz -C /opt')
run('rm -f apache-spark.tgz')
os.environ["SPARK_HOME"] = spark_path
os.environ["PATH"] += ":$SPARK_HOME/bin"
Get Apache SystemDS
Apache SystemDS development happens on GitHub at apache/systemds ↗
!git clone https://github.com/apache/systemds systemds --depth=1
%cd systemds
# Option 1: Build only the java codebase
maven('clean package -q')
# Option 2: For building along with python distribution
# maven('clean package -P distribution')
!export SYSTEMDS_ROOT=$(pwd)
!export PATH=$SYSTEMDS_ROOT/bin:$PATH
2. Download Haberman data
Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
About: The survival of patients who had undergone surgery for breast cancer.
Data Attributes:
- Age of patient at time of operation (numerical)
- Patient's year of operation (year - 1900, numerical)
- Number of positive axillary nodes detected (numerical)
- Survival status (class attribute)
- 1 = the patient survived 5 years or longer
- 2 = the patient died within 5 year
!mkdir ../data
!wget -P ../data/ https://web.archive.org/web/20200725014530/https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
# Notice that the test is plain csv with no headers!
!sed -n 1,10p ../data/haberman.data
!echo '{"rows": 306, "cols": 4, "format": "csv"}' > ../data/haberman.data.mtd
# generate type description for the data
!echo '1,1,1,2' > ../data/types.csv
!echo '{"rows": 1, "cols": 4, "format": "csv"}' > ../data/types.csv.mtd
!ls
!ls scripts/algorithms
# Output the algorithm documentation
# start from line no. 22 onwards. Till 35th line the command looks like
!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml
!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE
!sed -n 1,10p ../data/univarOut.mtx
%%writefile ../test.dml
# This code code acts as a playground for dml code
X = rand (rows = 20, cols = 10)
y = X %*% rand(rows = ncol(X), cols = 1)
lm(X = X, y = y)
Submit the dml
script to Spark with spark-submit
.
More about Spark Submit ↗
!$SPARK_HOME/bin/spark-submit \
./target/SystemDS.jar -f ../test.dml
# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml