Technical Architecture

EMBEDICA.AI Anvil

To succeed in the world’s rapidly evolving ecosystem, companies (no matter what their industry or size) must use data to continuously develop more innovative operations,processes, and products.

This means embracing the shift to Enterprise AI, using the power of machine learning to enhance – not replace – humans. Working with data is a hard task and can be very distractive. The preparation of data only takes up to 70% of a data science project’s lifespan.

Our product is an “anvil” for data science projects. Instead of steel, the raw material here is data. Anvil is there to bring that hard work to a comfort zone.

We do not aim to simplify algorithms, we aim to provide a comfy tool for running complex data science projects. Targeted users are data scientists, analysts and data engineers. Providing a simplistic user interface also enables participation of less advanced data workers. This is pretty useful when thinking of SME.

Anvil is a cloud platform for data scientists who need powerful compute, fast configuration, secure collaboration and easy deployment.

Anvil is the centralized data platform that moves businesses along their data journey from analytic at scale to Enterprise AI,powering self-service analytic while also ensuring the operationalization of machine learning models in production.

An Anvil project is where all data science related tasks are held in one place: From data processing to running various models and evaluating them, along comparing model performances and selecting best performing ones for deployment. Anvil provides powerful explainability feature during model development as well.

Key Features

Seamless connectivity to any data, no matter where it’s stored or in what f
Faster data cleaning, wrangling, mining,and visualization.
Unlimited file-size upload, automated data exploration and scalable computing power.
The latest in machine learning technology(including AutoML and deep learning) all inone place and ready to be operationalized with automation environments, scenarios, and advanced monitoring.
Allows you to share data and code securely with team members, publish version-controlled reports of your analysis .
Every step in the data-to-insights process can be done with a visual interface.
Increase reproducible and establish a baseline for scientific research or applications.
Run all data science processes within projects, and access them any time.
Powerful explainability functionalities where model developers need to interpret advanced models

Feature Overview

◆ DATA REPOSITORIES

Repositories are logical divisions for datasets. Meaning that Anvil keeps and lets access the datasets within these divisions. The user can define data repositories like for finance, products, and sales, and can relate countless uploaded datasets to these divisions for easy access at later stages. Data Repositories are bounded to teams.

◆TEAMS

Users of Embedica Anvil can be assigned to Teams. Teams are participated within projects, doing all data-related work.

◆ PROJECTS

Once a project is created, teams can be assigned to a project. Teams come along with their accessible data sets (repos).

When developing “projects” we strictly hold on implementing the CRISP-DM approach. Meaning that within a project’s life cycle (and even after) users can switch back and forth between different processes of data science project.

A Centralized,Controlled Environment

Connect to existing data storage systems and leverage plugins and connectors for access to all data from one, central location.

Maintain enterprise-level security with data governance features like documentation, projects, task organization, change management, rollback, monitoring, etc.

Reuse and automate work (via data transformations, code snippets, centralized best practices, automated assistants, and more), to go from raw data to insights faster.

Organized , enterprise-level tool to leverage benefits of various (and counting) AI/ML algorithms.

A Common Ground for Experts and Explorers

Built from the ground up for Data Scientists, Data Engineers, Data Analysts, Business Leaders or Manager and Team M
From code-free data wrangling using built-in visual processors to the visual interface for machine learning, non-technical users aren’t left in the dust when it comes to contributing to the machine learning projects.

A Shortcut to ML/AI Ope-rationalization

Centrally manage models and update them from one location, integrating with an API without having to modify or inject anything into existing applications.
Prevent model drift with the ability to easily monitor performance and easily make necessary adjustments.

AI TRUST – Explain ability

Explainability is creating diverse explanations, including training highly optimized directly interpretable models, creating contrastive explanations of black box models, using information flow in a high-performing complex model to train simpler, interpretable classifiers, learning disentangled representations, and visualizing information flows in neural networks. Powerful explainability functionalities where model developers need to interpret advanced models

In many applications, trust in an AI system will come from its ability to ‘explain itself’.

Built-in EM EASI AI TRUST module

Train and Explain models to boost AI Trust and adaptability in ADI Projects

Anvil explainability is implemented in two ways:

The first one is an assertive feature during the model development process, giving the developer insight on the model Helps understanding ‘complex models’ prediction behaviors.

The second implementation is foreseen for post-production that means after a model is deployed if there are any concerns on the model’s behavior, explainability module provides more detailed inspection.

Connectivity

Anvil allows you to seamlessly connect to your data no matter where it’s stored or in what format. That means easy access for everyone – whether technical or not – to the data they need.

We can say the system will be capable of connecting to Big Data environment for fetching structured and structured data from clustered environments. With Big Data Connectivity current Anvil’s resources will be extended to a Big Data scale to big data envronments like Hadoop.

■ SQL Databases

◆ MySQL

◆ PostgreSQL

◆ Vertica

◆ Amazon Redshift

◆ Pivotal Greenplum

◆ Teradata

◆ IBM Netezza

◆ SAP HANA

◆ Oracle

◆ Microsoft SQL Server (incl. SQL DW)

◆ Google BigQuery

◆ IBM DB2

◆ Exasol

◆ MemSQL

◆ Snowflake

◆ Custom connectivity through JDBC

■ NoSQL Databases

◆ MongoDB

◆ Cassandra

◆ ElasticSearch

■ Hadoop & Spark Supported Distributions

◆ Cloudera

◆ Hortonworks

◆ Google DataProc

◆ MapR

◆ Amazon EMR

◆ DataBricks

■ Hadoop File Formats

CSV
Cloud Object Storage
Amazon S3
Google Cloud Storage
Microsoft Azure Storage
Remote Data Sources
FTP
SCP
SFTP
HTTP
Custom Data Sources – extended connectivity through Anvil Plugins
Connect to REST APIs,
Create custom file formats
Connect to own databases
Open-Source Tools & API integration and customization available

Optimized sync between:

Amazon S3

ANVİL

Train the best model in the least amount of time to save human hours.
Reduce the need for expertise in machine learning by reducing the manual code-writing time.
Improve the performance of machine learning models.
Increase reproducible and establish a baseline for scientific research or applications.
Scales training data set to clusters (Hadoop, Spark, Kubernetes)

Exploratory Analytics

Sometimes you need to do a deep dive on your data, but other times,its important to understand it at a glance. From exploring available datasets to dash boarding, Anvil makes this type of analysis easy.

Data Analysis

Automatically detect dataset schema and data types
Assign semantic meanings to your datasets columns
Build uni-variate statistics automatically &derive data quality checks
Dataset audit
Automatically produce data quality and statistical analysis of entire Anvil datasets
Support of several back-ends for audit (in-memory,Spark, SQL)

Data Cataloging

Search for data, comments, features, or models in a centralized catalog.
Explore data from all your existing connections

Advanced analysis

Interactive visual statistics

Univariate analysis and statistical tests on single or multiple populations.
Statistics and tests on multiple populations
Correlations analysis
Principal Components Analysis

Data Visualization

Create standard charts (histogram, bar charts,etc) and scale charts’ computation by leveraging underlying systems (in-database aggregations)
Create custom charts using:
Custom Python-based

Dashboarding

User-managed reports, projects,teams and dashboards
Shows streaming history and activity as real-time analysis, anomaly detection,
Custom Insights Plotly, Matplotlib)
Custom interactive, web-based visualisations

Maintenance Customization Monitoring and Alerting Services

Data Preparation

Traditionally, data preparation takes up to 80 percent of the time of a data project. But Anvil’s data prep features make that process faster and easier, which means more time for more impact and creative work.We can say the system will be capable of connecting to Big Data environment for fetching structured and structured data from clustered environments.Despite to the agnostic nature of ML models, Anvil is specialized with use in any cases from cyber security and health care dataset .

Visual Data Transformation.

Design your data transformation jobs using a point-and-click interface

Group
Filter
Sorts
Stack
Join
Window
Sync
Distinct
Top-N
Pivot
Split

■ Scale your transformations by running them directly in distributed computations systems (SQL, Spark)

Data repository:

Data Pool:

Pools have changed datasets which any changing on the dataset that supports change data type, rename removes, or add new a column that means you can reshape according to your requests additional a still continues stored the original dataset on the data repository.

Table Join: According to join types the user can merge two different datasets to one new that store of permanently.

Dataset Sampling:

First records, random selection, stratified sampling,specifical sampling, etc.

Interactive Data Preparation:

Scale data data preparation scripts using in-database (SQL) or in-cluster (Spark) processing
Automatically turn data preparation scripts into Spark or Map Reduce jobs

Machine Learning

Anvil offers the latest machine learning technologies all in one place so that data scientists can focus on what they do best: building and optimizing the right model for the use case at hand.

Automated Machine Learning(AutoML)

AutoML or Automatic Machine Learning is the process of automating algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment.

AutoMl technology makes it easy to train and evaluate machine learning models for everyone and everysize. Automating repetitive tasks allows people to focus on the data and the business problems they are trying to solve.

AutoML Technology makes easy for EVERYONE & EVERYSIZE

u Automated ML strategies

Quick prototypes
Interpretable models
High performance

u Features handling for machine learning

Support for numerical, categorical, text and vector features
Automatic preprocessing of categorical features (Dummy encoding, impact coding, hashing, custom preprocessing, etc.)
Automatic preprocessing of numerical features (Standard scaling, quantile-based binning, custom preprocessing, etc.)
Data Pre-Processing of data editing for uploading, cleaning, and viewing data fields
Automatic preprocessing of text features (TF/IDF, Hashing trick, Truncated SVD, Custom preprocessing)
Various missing values imputation strategies
Features generation
Feature-per-feature derived variables (square, square root…)
Linear and polynomial combinations
Features selection
Filter and embedded methods

Choose between several ML backends to train your models
TensorFlow
Keras
Scikit-learn
XGBoost
MLLib
H20
Algorithms
Python Based
Logistic regression
Random Forests
XGBoost
Decision Tree
K-Means
Spark MLLib-based
Logistic Regression
Linear Regression
Decision Trees
Random Forest
H20-based
Random Forest

Linear Regression
Logistic Regression
Naïve Bayes
KNN
SVM
Decision Trees
Random Forest

How can AutoML Help You?

If you’re part of the majority of data scientists who work with tabular or “relational” data (tables with numeric and/or categorical columns), then Embedica Anvil is great tool to use.

Machine Learning

Anvil offers the latest machine learning technologies all in one place so that data scientists can focus on what they do best: building and optimizing the right model for the use case at hand.

Automated Machine Learning (AutoML)

Hyperparameters optimisation
Freely set and search hyperparameters
Analyzing model training results
Get insights from your model
Features importance
Model parameters
Publish training results to Anvil Dashboards
Automatically create ensemble from several models
Scoring capabilities
Real-time serverless scoring API
Distributed batch with Spark
SQL (in-database scoring)
Anvil built-in engine
Model export

Machine Learning

Anvil offers the latest machine learning technologies all in one place so that data scientists can focus on what they do best: building and optimizing the right model for the use case at hand.

Model Deployment

Model version
Batch scoring
Real-time scoring
Easily manage all your model deployments
One-click deployment of models

Export Model
Users can download the trained model for them to use wherever they want.
The model is saved as a pickle file that the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.
Deploy as API
Users can deploy their model as API and can run predictions and other stuff related to that model from external sources.
Users can use the API on their own apps or programs for their customer or themselves

Expose arbitrary functions and models through API’s
Write custom R, Python or SQL based functions or models
Automatically turn them into API endpoints for operationalisation

Docker & Kubernetes
Deploy models into Docker containers for operationalisation
Automatically push images to Kubernetes clusters for high scalability

Model monitoring mechanism

Control model performances over time
Automatically retrain models in case of performance drift
Customize your retraining strategies

Logging
Log and audit all queries sent to your models

Deep Learning
User-defined model architecture
Personalize training settings
Support for multiple inputs for your models
Support pre-trained models
Extract features from images
Tensorboard integration

Machine Learning

Anvil offers the latest machine learning technologies all in one place so that data scientists can focus on what they do best: building and optimizing the right model for the use case at hand.

Unsupervised Learning

Automated features engineering (similar to Supervised learning)
Optional dimensionality reduction
Outliers detection

Algorithms

Automation Features

When it comes to streamlining and automating workflows, Anvil allows data teams to put the right processes in place to ensure models are properly monitored and easily managed in production.

Data Flow

Keep track of the dependencies between your datasets
Manage the complete data lineage
Check consistency of data, schema or data types

Partitioning

Leverage HDFS or SQL partitioning mechanisms to optimize computation time

Metrics & Checks

Create Metrics assessing data consistency and quality
Adapt the behavior of your data pipelines and jobs based on Checks against these Metrics
Leverage Metrics and Checks to measure potential ML models drift over time

Monitoring

Track the status of your production scenarios
Visualize the success and errors of your Anvil Jobs
Maintenance Customization Monitoring and Alerting Services

Automation Environments

Trigger the execution of your data flows and applications on a scheduled or event-driven basis
Create complete custom execution scenarios by assembling a set of actions to do (steps)
Leverage built-in steps or define your own steps through a Python API
Publish the results of the scenarios to various channels through Reporters (Send emails with custom templates; attach datasets, logs, files,or reports to your Reporters; send notifications

Automation Environments

Use dedicated Anvil Automation nodes for production pipelines
Connect and deploy on production systems (data lakes, databases)
Activate, use or revert multiple Anvil project bundles

Code

Work in the tools and with the languages you already know everything can be done with code and fully customized. And for tasks where it’s easier to use a visual interface, Anvil provides the freedom to switch seamlessly between the two.

Collaboration

Anvil was designed from the ground up with collaboration in mind. From knowledge sharing to change management to monitoring, data teams including scientists, engineers, analysts, and more can work faster and smarter together. Shared platform for Data Scientists, Data Engineers, Data Analysts, Business Leaders and Manager, etc.

Team Activity Monitoring

Global search to quickly find all project assets, plugins, wiki, reference docs, etc.
Shared code-based components
Distribute reusable code snippets for all users
Package arbitrary complex function, operation or business logic to be used by less-technical users

Governance & Security

Anvil makes data governance easy, bringing enterprise-level security with fine-grained access rights and advanced monitoring for admins or project managers who needs to teams and departments.

Data Security

Residing on customers’ on-prem or virtual private could, the data is stored on client’s own machines also Anvil is provided as a Docker image.
Connection to the outside world is made when importing data from Cloud providers (Amazon S3, Microsoft Azure, etc.) or for Embedica Anvil licence validation.

User profiles

Role-based access(fine-grained or custom)

Authentication management

Use SSO systems
Connect to your corporate database (LDAP, Active Directory…) to manage users and groups

Enterprise-grade security

Track and monitor all actions in Anvil using an audit trail
Authenticate against Hadoop clusters and databases through Kerberos
Supports users impersonation for full traceability and compliance

Custom policy framework for data protection and external regulations compliance

Framework capabilities

◊ Document data sources with sensitive information, and enforce good practices

◊ Restrict access to projects and data sources with sensitive information

◊ Audit the sensitive information in an Anvil instance

Architecture

Anvil was built for the modern enterprise, and its architecture ensures that

businesses can stay open (i.e., not tied down to a certain technology) and that

they can scale their data efforts.

No client installation that easy to use web-based Cloud Deployment for Anvil users
Anvil nodes that use dedicated Anvil environments or nodes to design, run, and deploy your ML applications.
Integrations
Leverage distributed systems to scale computations through Anvil
Automatically turn Anvil jobs into SQL, Spark, MapReduce, Hive, or Impala jobs for in-cluster or in-database processing to avoid unnecessary data movements or copies
Modern architecture (Docker, Kubernetes, )
Full AutoML Deployment with Docker Containers for suitable for all environments

Trace-ability and debugging through full system logs

small and medium enterprise