Optimized Row Columnar – CREATE DATABASE dbName; GO – Microsoft Azure Data Engineering Associate (DP-203) Study Guide

Optimized Row Columnar – CREATE DATABASE dbName; GO

by Terica Anderson
2024-06-30

This file format is a columnar format used in the Hadoop ecosystem, like parquet files. Both ORC and parquet files are often referred to as self‐describing, which means the information that describes the data in the file is contained within the file itself. Metadata typically accompanies the file. For example, in Windows, when you right‐click a file in Windows Explorer and select Properties, you will see information about the file. That is considered metadata; that information would instead be within the ORC or parquet file itself, making the file self‐describing.

Facebook was the company that came up with the ORC file format. ORC is supported on Apache Hive and Apache Pig, which is optimized for Hadoop Distributed File System (HDFS) read operations. Parquet, on the other hand, works best with Apache Spark and is its default format for reading and writing data on that architecture. When it comes to ORC files, there is a concept known as a stripe. You might think of a zebra with vertical black and white stripes, and that is exactly what it is. Since the context here is columnar, which is vertical as compared to a row that is horizontal, then perhaps it is easier to comprehend. Data structures are conceptual but are physically stored in memory or on disk. Therefore, the memory addresses at which they are stored and the drive sectors on which they are placed have a significant impact on latency. The closer the addresses are together and the closer the sectors are, the faster all the data is retrieved. It has been proven that taking this vertical approach has a speed advantage over a horizontal approach.

Working with an ORC file is very similar to working with a parquet file. Assume that you have already loaded the data into a DataFrame. The first line of the PySpark code snippet saves the data in df to an ORC file. The lines of code that follow show how to write an ORC file, create a table view, and then display the contents of the ORC file:

df.write.format(‘orc’).save(‘/temp/orc/brainjammer/tiktok.orc’)
spark.sql(“CREATE VIEW BRAIN USING parquet OPTIONS (‘/tmp/output/brainjammer/*’) “)spark.sql(‘SELECT * FROM BRAINWAVE’).show()

+———-+———-+———–+———–+———-+
| Session | Counter | Electrode | Frequency | Measure |
+———-+———-+———–+———–+———-+
| TikTok   | 5        | AF3       | THETA     | 9.681    |
| TikTok   | 5        | AF3       | ALPHA     | 3.849    |
| TikTok   | 5        | AF3       | GAMMA     | 0.738    |
+———-+———-+———–+———–+———-+

Partitioning and querying of the data in an ORC file is the same as previously shown in the parquet section. If you are going to use Apache Hive, then you would want to use the ORC file format to get the best performance. The best uses of ORC include the following:

When working with read‐heavy analytical applications
When working with Apache Hive

This file format is a columnar format used in the Hadoop ecosystem, like parquet files. Both ORC and parquet files are often referred to as self‐describing, which means the information that describes the data in the file is contained within the file itself. Metadata typically accompanies the file. For example, in Windows, when you right‐click…

Optimized Row Columnar – CREATE DATABASE dbName; GO

Leave a Reply Cancel reply

Categories

Archives