Azure Data Lake Storage – Gaining the Azure Data Engineer Associate Certification

Azure Data Lake Storage (ADLS) is a fundamental piece of most enterprise data analytics solutions running on Azure. This product is optimized for Big Data analytics workloads. ADLS accomplishes this by providing storage capacity of up to multiple exabytes of data and supplying access to that data at a throughput of hundreds of gigabytes per second. ADLS Gen2 supports the open source platforms described in Table 1.4.

TABLE 1.4 ADLS‐supported platforms

PlatformSupported version
Azure Databricks5.1+
Cloudera6.1+
Hadoop3.2+
HDInsight3.6+
Hortonworks3.1.x+

ADLS Gen2 can also be easily integrated with many Azure products, such as Azure Data Factory, Azure Event Hub, Azure Machine Learning, Azure Stream Analytics, IoT Hub, Power BI, and Azure SQL databases. Additional information and capabilities include the following:

  • Gen1 vs. Gen2
  • Hadoop Distributed File System (HDFS)
  • ACL and POSIX security model
  • Hierarchical namespaces

Gen1 vs. Gen2

ADLS Gen1 will be retired as of February 29, 2024. Therefore, we don’t recommended that you build any new solutions on that version. As mentioned earlier, Azure Data Lake Analytics uses Gen1; therefore, we also don’t recommended building new data analytics solutions with that product either. ADLS Gen2 supports all the capabilities that exist in ADLS Gen1. The significant change is that Gen2 is now aligned with and built on Azure Blob Storage. Building on top of Azure Blob Storage (described later) makes ADLS Gen2 more cost effective and provides diagnostic logging capabilities and access tiers.

Hadoop Distributed File System

If you have used HDFS in the past, you can expect the same experience when using ADLS. This has to do with how you and the operating system interact with data files. Reading, writing, copying, renaming, and deleting are most of the activities you would expect to be able to perform. The Azure Blob Filesystem (ABFS) driver is available on all Apache Hadoop environments such as Azure Synapse Analytics, Azure Databricks, and Azure HDInsight. ABFS has some major performance improvements over the previous Windows Azure Storage Blob (WASB) driver when it comes to renaming and deleting files. Examples of HDFS commands to create a directory, to copy data from local storage to a cluster, and to list the contents of a directory are shown here:

hdfs dfs -mkdir /brainjammer/
hdfs dfs -copyFromLocal meditation.json /brainjammer/
hdfs dfs -ls /brainjammer/

Azure Data Lake Storage (ADLS) is a fundamental piece of most enterprise data analytics solutions running on Azure. This product is optimized for Big Data analytics workloads. ADLS accomplishes this by providing storage capacity of up to multiple exabytes of data and supplying access to that data at a throughput of hundreds of gigabytes per…

Leave a Reply

Your email address will not be published. Required fields are marked *