Databricks with BioStudio
¶Databricks is a unified analytics platform built on Apache Spark, designed for large-scale data processing. It integrates data engineering, data science, and business analytics into one environment. Delta Lake adds reliability with ACID transactions, while MLflow manages the machine learning lifecycle. The platform supports massive data volumes and complex workloads with scalability, offering collaborative notebooks and real-time co-editing to facilitate teamwork. Databricks integrates with AWS, Azure, Google Cloud, and various databases, providing enterprise-level security and compliance. It handles infrastructure management and is versatile for data warehousing, ETL, machine learning, and real-time analytics.
Databricks integrates seamlessly with Amazon S3, offering a robust environment for big data processing and analytics. It can read and write data directly to and from S3, allowing users to utilize S3 as a cost-effective and scalable storage solution. Databricks also allows you to mount S3 buckets to the Databricks file system (DBFS), making it easier to access and manage data stored in S3 as if it were part of the local file system.
BioStudio supports integration with Databricks, enabling users to leverage its capabilities for data exploration and job execution. BioStudio offers various tools to streamline these processes. Currently, BioStudio supports:
+=============================+
| User's Workspace |
+=============================+
|
| Data Sync
v
+=============================+
| BioStudio Integration |
| with Databricks |
+=============================+
| | |
| | |
+---------------------+ | +---------------------+
| | |
v v v
+-----------------+ +-----------------+ +-----------------+
| Cloud Object | | SQL Warehouse | | Databricks |
| Storage (AWS, | | Connection | | Clusters |
| GCP, Azure) | | | | |
| | | | | |
+-----------------+ +-----------------+ +-----------------+
| | |
| | |
v v v
+---------------------+ +-----------------+ +-----------------+
| Amazon S3 | | SQL Data | | EC2 Instances |
| | | Querying/ | | (Configured |
| | | Analysis | | for HPC with |
| | | | | FSx for Lustre) |
+---------------------+ +-----------------+ +-----------------+
|
| Connected via FSx for Lustre
|
v
+---------------------+
| High-Performance |
| Parallel Workflows |
| (SLURM/SGE Paradigm)|
+---------------------+
🔆 This bucket is associated with Databricks.
🔆 This bucket is associated with Databricks.
🔆 Mount S3 bucket.
🔆 Cluster Connection
Complete databricks including resouces can be connected to BioStudio using token.
[Bioturing-databrick]
host = https://dbc-454984-e221564448e.cloud.databricks.com
token = dap2165411316rdtffdbvrdftgyrtfhgrd321654646
🔆 SQL Warehouses
Generate token to execute query from BioStudio.
Create conda environment / kernel. Select Kernels -> fill all the values.
ub-lalit-e33d9fbdb4e96d0@colabdev-868c4b58bb-ctb8k:~$ conda env list
# conda environments:
#
databrick /data/ub-lalit-e33d9fbdb4e96d0/.conda/envs/databrick
databrick-sqlwarehose /data/ub-lalit-e33d9fbdb4e96d0/.conda/envs/databrick-sqlwarehose
base /miniconda/user
ub-lalit-e33d9fbdb4e96d0@colabdev-868c4b58bb-ctb8k:~$ conda activate databrick-sqlwarehose
(databrick-sqlwarehose) ub-lalit-e33d9fbdb4e96d0@colabdev-868c4b58bb-ctb8k:~$
pip install databricks-sql-connector
🔆 Compute