A machine learning package built for humans.

Related tags

aerosolve
Overview

aerosolve

Machine learning for humans.

Build Status Download Download

What is it?

A machine learning library designed from the ground up to be human friendly. It is different from other machine learning libraries in the following ways:

This library is meant to be used with sparse, interpretable features such as those that commonly occur in search (search keywords, filters) or pricing (number of rooms, location, price). It is not as interpretable with problems with very dense non-human interpretable features such as raw pixels or audio samples.

There are a few reasons to focus on interpretability:

  • Your corpus is new and not fully defined and you want more insight into your corpus
  • Having interpretable models lets you iterate quickly. Figure out where the model disagrees most and have insight into what kind of new features are needed.
  • Debugging noisy features. By plotting the feature weights you can discover buggy features or fit them to splines and discover features that are unexpectedly complex (which usually indicates overfitting).
  • You can discover relationships between different variables and your target prediction. e.g. For the Airbnb demand model, plotting graphs of reviews and 3-star reviews is more interpretable than many nested if then else rules.

Graph of reviews and 3-star reviews and feature weight

How to get started?

The artifacts for aerosolve are hosted on bintray. If you use Maven, SBT or Gradle you can just point to bintray as a repository and automatically fetch the artifacts.

Check out the image impression demo where you can learn how to teach the algorithm to paint in the pointillism style of painting. Image Impressionism Demo.

There is also an income prediction demo based on a popular machine learning benchmark. Income Prediction Demo.

Feature Representation

This section dives into the thrift based feature representation.

Features are grouped into logical groups called families of features. The reason for this is so we can express transformations on an entire feature family at once or interact two different families of features together to create a new feature family.

There are three kinds of features per FeatureVector:

  • stringFeatures - this is a map of feature family to binary feature strings. For example "GEO" -> { "San Francisco", "CA", "USA" }
  • floatFeatures - this is a map of feature family to feature name and value. For example "LOC" -> { "Latitude" : 37.75, "Longitude" : -122.43 }
  • denseFeatures - this is a map of feature family to a dense array of floats. Not really used except for the image content analysis code.

Example Representation

Examples are the basic unit of creating training data and scoring. A single example is composed of:

  • context - this is a FeatureVector that occurs once in the example. It could be the features representing a search session for example. e.g. "Keyword" -> "Free parking"
  • example(0..N) - this is a repeated list of FeatureVectors that represent the items being scored. These can correspond to documents in a search session. e.g. "LISTING CITY" -> "San Francisco"

The reasons for having this structure are:

  • having one context for hundreds of items saves a lot of space during RPCs or even on disk
  • you can compute the transforms for the context once, then apply the transformed context repeatedly in conjunction with each item
  • having a list of items allows the use of list based loss functions such as pairwise ranking loss, domination loss etc where we evaluate multiple items at once

Feature Transform language

This section dives into the feature transform language.

Feature transforms are applied with a separate transformer module that is decoupled from the model. This allows the user to break apart transforms or transform data ahead of time of scoring for example. e.g. in an application the items in a corpus may be transformed ahead of time and stored, while the context is not known until runtime. Then at runtime, one can transform the context and combined them with each transformed item to get the final feature vector that is then fed to the models.

Feature transforms allow us to modify FeatureVectors on the fly. This allows engineers to rapidly iterate on feature engineering quickly and in a controlled way.

Here are some examples of feature transforms that are commonly used:

  • List transform. A meta transform that specifies other transforms to be applied
  • Cross transform. Operates only on stringFeatures. Allows interactions between two different string feature families. e.g. "Keyword" cross "LISTING CITY" creates the new feature family "Keyword_x_city" -> "Free parking^San Francisco"
  • Multiscale grid transform Constructs multiple nested grids for 2D coordinates. Useful for modelling geography.

Please see the corresponding unit tests as to what these transforms do, what kind of features they operate on and what kind of config they expect.

Models

This section covers debuggable models.

Although there are several models in the model directory only two are the main debuggable models. The rest are experimental or sub-models that create transforms for the interpretable models.

Linear model. Supports hinge, logistic, epsilon insensitive regression, ranking loss functions. Only operates on stringFeatures. The label for the task is stored in a special feature family and specified by rank_key in the config. See the linear model unit tests on how to set up the models. Note that in conjunction with quantization and crosses you can get incredible amounts of complexity from the "linear" model, so it is not actually your regular linear model but something more complex and can be thought of as a bushy, very wide decision tree with millions of branches.

Spline model. A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min and max of a feature's range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional function such as a polynomial spline with Dirac delta endpoints. If the RMSE of the projection is above threshold, we leave the spline alone in the high resolution piecewise linear mode. This allows us to debug the spline model for features that are buggy or unexpectedly complex (e.g. jumping up and down when we expect some kind of smoothness)

  • Boosted stumps model - small compact model. Not very interpretable but at small sizes useful for feature selection.
  • Decision tree model - in memory only. Mostly used to generate transforms for the linear or spline model.
  • Maxout neural network model. Experimental and mostly used as a comparison baseline.

IDE

If you use intellij, try build first, so that thrift classes is available and to fix the spark compiling error inside intellij, type command+; and click dependency and change related files from test to compile, such as org.apache.spark and org.apache.hadoop:hadoop-common. We keep gradle config as testCompile so that to reduce jar file size.

Support

Hackpad

Dev group

User group

In the wild

Organizations and projects using aerosolve can list themselves here.

Issues
Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Haifeng Li 5.3k Sep 15, 2021
A machine learning package built for humans.

aerosolve Machine learning for humans. What is it? A machine learning library designed from the ground up to be human friendly. It is different from o

Airbnb 4.8k Sep 13, 2021
Abstract machine for formal semantics of SIMP (Simple Imperative Language)

SIMP-abstract-machine In 2020/21 I was a Teaching Assistant for the second year module 5CCS2PLD Programming Language Paradigms at King's College Londo

Sten Arthur Laane 26 Mar 2, 2021
An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Amazon Web Services - Labs 2.1k Sep 16, 2021
Java Statistical Analysis Tool, a Java library for Machine Learning

Java Statistical Analysis Tool JSAT is a library for quickly getting started with Machine Learning problems. It is developed in my free time, and made

null 706 Sep 15, 2021
statistics, data mining and machine learning toolbox

Disambiguation (Italian dictionary) Field of turnips. It is also a place where there is confusion, where tricks and sims are plotted. (Computer scienc

Aurelian Tutuianu 57 Sep 16, 2021
SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Spark has rapidly emerged as the de facto standard for big data processing. However, it is not designed for machine learning which has more and more limitation in AI scenarios. SparkFE rewrite the execution engine in C++ and achieve more than 6x performance improvement for feature extraction. It guarantees the online-offline consistency which makes AI landing much easier. For further details, please refer to SparkFE Documentation.

4Paradigm 67 Jun 10, 2021
On-device wake word detection powered by deep learning.

Porcupine Made in Vancouver, Canada by Picovoice Porcupine is a highly-accurate and lightweight wake word engine. It enables building always-listening

Picovoice 2.5k Sep 15, 2021
Learning Based Java (LBJava)

Learning Based Java LBJava core LBJava examples LBJava maven plugin Compiling the whole package From the root directory run the following command: Jus

CogComp 12 Jun 9, 2019
Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

weka (mirror) Computing and Mathematical Sciences at the University of Waikato now has an official github organization including a read-only git mirro

Benjamin Petersen 305 Jul 17, 2021
jMonkeyEngine Animation State Machine

jme-anim-state-machine jMonkeyEngine Animation State Machine State machine to make animation states consume from a character controller state. This wa

Niels Drost 4 Jul 5, 2021
Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

The Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application. This mean

Eclipse Foundation 12.2k Sep 9, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Sep 8, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
CompreFace is a free and open-source face recognition system from Exadel

CompreFace can be easily integrated into any system without prior machine learning skills. CompreFace provides REST API for face recognition, face verification, face detection, landmark detection, age, and gender recognition and is easily deployed with docker

Exadel 1.3k Sep 19, 2021
Sparkling Water provides H2O functionality inside Spark cluster

Sparkling Water Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides: Utilities to publish Spark data struct

H2O.ai 905 Sep 7, 2021
Java version of LIBLINEAR

This is the Java version of LIBLINEAR. The project site of the original C++ version is located at http://www.csie.ntu.edu.tw/~cjlin/liblinear/ The ups

Benedikt Waldvogel 298 Sep 14, 2021