Walkthrough Introduction

This article introduces you to the application-building process by discussing the following topics:

Application Design Considerations

When designing an application that you intend to build using App Workbench:

  • EPIC runs applications unmodified. You can have multiple services running inside a single container, much like a physical installation. For example, both the NameNode and Yarn RM services can run within the same Docker container.
  • For simplicity, BlueData encourages the use of a single Docker image file for a given application. For example, CDH 6.2 has a single Docker image to run all its services.
  • Per-node service placement is controlled through a service definition catalog (JSON) file.
  • If necessary, an add-on service can be stored in its own Docker image file and be attached as a dependent service into a Hadoop or Spark cluster.

Application Lifecycle

A BlueData EPIC application has the following lifecycle:

  1. The Dockerfile is created. See Creating the Dockerfile.
  2. Deployment metadata is added to a JSON file. See Metadata JSON File.
  3. Node initialization scripts (called startscripts) are added to configure and run services when the container is started. See Startscripts.
  4. A distributable .bin file is created. See Creating the .bin File.
  5. The .bin file is copied to the BlueData EPIC platform and then registered in the App Store. See Registering the Application.
  6. The application is removed from the EPIC platform when no longer needed. See Removing the Application.

The App Workbench allows developers to perform Steps 1-4 of this procedure. Developers can then copy and share the resulting .bin files. Please see the following articles:

Creating the Dockerfile

Image developers create and use Dockerfiles on an application development machine. Most Dockerfiles will start from base OS images provided by BlueData (see Upgrading an Existing Image), however organizations can build entirely custom base images and add the necessary BlueData components (see About Custom Base Images). EPIC versions 3.1 and earlier embed the Dockerfile into the application .bin file. EPIC versions 3.2 and later can use images that reside in preferred Docker-compatible registries. For backward compatibility, these more recent versions of EPIC will also support embedded Dockerfiles.

Metadata JSON File

Most Big Data applications require running multiple services per node, and running different sets of services on different nodes. The set of services that run on a given node is controlled by the role assigned to that node. The Catalog metadata JSON file includes the application name, ID, version, logo, and UI preferences. This JSON file also includes a list of the services, roles, and role-to-service assignments that will be used when a cluster is created from that image. Services can be registered with BlueData management for auto-start and monitoring. See Manually Creating JSON and Startscript Files and Metadata JSON. If needed, you can define custom EPIC interface elements and services, as described in Additing Conditional UI Elements and Services.

Startscripts

The application package (.bin file) contains all of the scripts that are executed as part of the cluster bring-up process, as well as logic to start services that must be run on node(s) assigned to one or more specific role(s) defined in the metadata JSON file. These startscripts manage the service start sequence, auto-configure services by updating configuration files with runtime values, and implement the expansion/shrinking of clusters by adding nodes and reconfiguring other nodes. See Manually Creating JSON and Startscript Files.

Creating the .bin File

AppWorkbench allows users to build and package the application components (see Components of an Application .bin File) into a compressed .bin file for easy identification, distribution, and management. The application .bin can be created on a development machine and then copied to the EPIC Controller host for registration and addition to the App Store screen.

Registering the Application

Applications provided by BlueData will normally appear in the App Store screen. Custom applications must be manually added to EPIC in order for them to be visible in the App Store screen. To do this:

  1. Copy the .bin file to /srv/bluedata/catalog on the EPIC Controller host.
  2. In the EPIC interface, refresh the App Store screen. It should display a new icon using the logo image from the new .bin file.
  3. Click the Install button for the application. This process makes the application binaries ready and available for future use but does not create any clusters.

See Phase Six: Adding the .bin File to the EPIC App Store.

Removing the Application

The Platform Administrator can delete or disable an application from the App Store screen. Disabling an application prevents new clusters from being created using that application. Currently running clusters will continue to work, including expansion and shrinking. See . Applications can also be hidden from specific tenants, making them unavailable for use in the affected tenants. Users in that tenant will not be able to select the hidden application(s) when creating clusters. Custom (user-created) applications can be completely removed by deleting the cluster(s) that are using the application(s), and then removing the application .bin file from the /srv/bluedata/catalog directory on the EPIC Controller host.

Cluster Deployment Lifecycle

Clusters in EPIC have the following lifecycle:

  1. The cluster is created or restarted. See Creating the Cluster.
  2. Startscripts automatically configure the cluster to secure and customize the environment. See Bootstrapping.
  3. The cluster processes jobs for users. See Runtime.
  4. Users may expand or shrink the cluster to meet changing needs. See Scaling the Cluster.
  5. The cluster is stopped or deleted when no longer needed. A stopped cluster can be restarted when needed. See Stopping/Deleting the Cluster.

Creating the Cluster

When creating the cluster:

  • Application experts can create templates with all the necessary choices preconfigured.
  • Users can quick-launch clusters from templates. See About Templates.
  • Clusters can be created using either the EPIC interface or the API. See Creating A New Cluster.
  • Cluster creation options can be customized and presented to the user via custom elements by updating the metadata JSON file. See Adding Conditional UI Elements and Services.
  • bdvcli commands offer a rich set of information about the cluster and tenant from within a container.

Bootstrapping

Startscripts run immediately after a cluster is started or restarted. Some of the things you can do with a startscript include:

  • Set up single-realm or cross-realm Kerberos authentication for Hadoop applications.
  • Enable Transparent Data Encryption (TDE) on Kerberized clusters.
  • Set up custom LDAP/AD integration on all of the Docker nodes in the cluster.
  • Perform SAML integration for the cluster.

Running

While a cluster is running:

  • Authorized users can access the cluster via SSH.
  • ActionScripts can be used to bulk-add missing packages for R, Python, Java, and more. See .
  • ActionScripts can be run using the EPIC interface (see ) or the API.
  • Models and results can be saved to the default TenantStorage DataTap or to shared external storage via a DataTap, git, or S3 locations.
  • Any product can be deployed for testing or use once the cluster has been created.

This image shows a sample data science workflow:



A typical data science workflow includes creating, sharing, and iterating data, models, and results. EPIC includes reference implementations of Hadoop and Spark clusters out of the box with Jupyter, Zeppelin, and RStudio integrations to Spark modules. All processing and modeling clusters include a DataTap client that communicates with common storage resources. Different clusters in the same tenant or across different tenants can share assets, depending on the storage security model.

Scaling the Cluster

Running applications in BlueData can be scaled up or down using either the EPIC interface (see ) or the API. All registered services will automatically start and be configured after the cluster has been scaled up or down. Users specify the trigger for scaling an application. Triggers can be scripted based on utilization data. EPIC provides all of the necessary information, including (but not limited to) resource usage, cluster uptime, and the number of active users.

Stopping/Deleting the Cluster

Clusters can be stopped to release shared resources. SSH access is disabled, and running services stop when the cluster is stopped. Resources (eg, CPU, RAM, and GPU) are released. Storage resources remain allocated to a stopped cluster. Metadata is captured to facilitate returning the cluster to normal function upon restart. A cluster that is no longer needed can also be permanently deleted, causing all resources allocated to that cluster to be freed.