Feature Engineering tool
First you need to build Model Matrix CLI application distribution file:
sbt> universal:packageBin
CLI will be packaged as zip file and should be available at:
modelmatrix/modelmatrix-cli/target/universal/
To get list of all available commands simply run bin/modelmatrix-cli
without arguments:
modelmatrix-cli-0.0.2/bin/modelmatrix-cli
If you want to use custom database settings you should update application.conf
that is located at modelmatrix-cli-0.0.2/conf/application.conf
## Provide database configuration for modelmatrix catalog db
# modelmatrix {
# catalog.db {
# url = "jdbc:postgresql://postgrestest001/modelmatrix?user=ezhulenev"
# driver = org.postgresql.Driver
# keepAliveConnection = true
# }
# }
bin/modelmatrix-cli definition list [--name <name>]
bin/modelmatrix-cli definition view features --definition-id <model-definition-id>
bin/modelmatrix-cli definition view source --definition-id <model-definition-id>
bin/modelmatrix-cli definition validate --config <model-config>
bin/modelmatrix-cli definition add [options] --config <model-config>
Option | Example | Description |
---|---|---|
n, name | -n “v123” | matrix model definition name |
c, comment | -c “testing new model” | matrix model definition comment |
f, features | -f “features” | features definition property in config |
# Validate config
bin/modelmatrix-cli definition validate --config ./model-matrix-v123.conf
# Create new definition if it's valid
bin/modelmatrix-cli definition add \
--name "v123" \
--comment "testing" \
--config ./model-matrix-v123.conf
# Check that it appears in list of all definitions
bin/modelmatrix-cli definition list
# Find it by name
bin/modelmatrix-cli definition list --name v123
# View feature definitions
bin/modelmatrix-cli definitions view --definition-id <model-definition-id>
bin/modelmatrix-cli instance list [--name <name>] [--definition <model-definition-id>]
bin/modelmatrix-cli instance view features --instance-id <model-instance-id>
bin/modelmatrix-cli instance view columns [options] --instance-id <model-instance-id>
Option | Example | Description |
---|---|---|
f, feature | -f ad_position | filter by feature name |
g, group | -g geographic | filter by feature group |
Check that input data compatible with model definition and can be used to build model instance
bin/modelmatrix-cli instance validate --definition-id <model-definition-id> --source <input-source>
bin/modelmatrix-cli instance create [options] --definition-id <model-definition-id> --source <input-source>
Option | Example | Description |
---|---|---|
n, name | -n “v123” | matrix model instance name |
c, comment | -c “test on new data” | matrix model instance comment |
concurrency | –concurrency 10 | number of concurrent transformations (Spark jobs) |
repartition | –repartition 1000 | repartition input source |
cache | –cache true | cache input source |
# Validate Hive table source
bin/modelmatrix-cli instance validate \
--definition-id 1 \
--source hive://mm.clicks_2015_05_05
# Create new instance if it's valid
bin/modelmatrix-cli instance create \
--name "2015-05-05" \
--comment "latest clicks model" \
--concurrency 10 \
--cache true \
--definition-id 1 \
--source hive://mm.clicks_2015_05_05
# Check that it appears in list of all instances
bin/modelmatrix-cli instance list
# Find it by name
bin/modelmatrix-cli instance list --name 2015-05
# View features
bin/modelmatrix-cli instance view features \
--instance-id <id> # id returned from 'create' command
# View columns
bin/modelmatrix-cli instance view columns \
--instance-id <id> # id returned from 'create' command
# View ony geographic columns
bin/modelmatrix-cli instance view columns \
--group geographic \
--instance-id <id> # id returned from 'create' command
Check that input data compatible with model instance and can be “featurized”
bin/modelmatrix-cli featurize validate --instance-id <model-instance-id> --source <input-source>
bin/modelmatrix-cli featurize sparse \
--instance id <model-instance-id> \
--source <input-source> \
--target <output-sink> \
--id-column <id-column>
<id-column>
- the name of a column that will be used as row id
# Validate Hive table source
bin/modelmatrix-cli featurize validate \
--instance-id 1 \
--source hive://mm.clicks_2015_05_05
# Create sparse feature matrix in Hive
bin/modelmatrix-cli featurize sparse \
--instance-id 1 \
--source hive://mm.clicks_2015_05_05 \
--target hive://mm.clicks_features_2015_05_05 \
--id-column auction_id
HDFS and Hive can be used as source or sink for Model Matrix CLI
bin/modelmatrix-cli ... hive://mm.clicks_2015_05_05
bin/modelmatrix-cli ... parquet://file:///Users/mm/clicks_2015_05_05.parquet
bin/modelmatrix-cli ... parquet://hdfs:///Users/mm/clicks_2015_05_05.parquet