Optimized Row Columnar – CREATE DATABASE dbName; GO
This file format is a columnar format used in the Hadoop ecosystem, like parquet files. Both ORC and parquet files are often referred to as self‐describing, which means the information that describes the data in the file is contained within the file itself. Metadata typically accompanies the file. For example, in Windows, when you right‐click a file in Windows Explorer and select Properties, you will see information about the file. That is considered metadata; that information would instead be within the ORC or parquet file itself, making the file self‐describing.
Facebook was the company that came up with the ORC file format. ORC is supported on Apache Hive and Apache Pig, which is optimized for Hadoop Distributed File System (HDFS) read operations. Parquet, on the other hand, works best with Apache Spark and is its default format for reading and writing data on that architecture. When it comes to ORC files, there is a concept known as a stripe. You might think of a zebra with vertical black and white stripes, and that is exactly what it is. Since the context here is columnar, which is vertical as compared to a row that is horizontal, then perhaps it is easier to comprehend. Data structures are conceptual but are physically stored in memory or on disk. Therefore, the memory addresses at which they are stored and the drive sectors on which they are placed have a significant impact on latency. The closer the addresses are together and the closer the sectors are, the faster all the data is retrieved. It has been proven that taking this vertical approach has a speed advantage over a horizontal approach.
Working with an ORC file is very similar to working with a parquet file. Assume that you have already loaded the data into a DataFrame. The first line of the PySpark code snippet saves the data in df to an ORC file. The lines of code that follow show how to write an ORC file, create a table view, and then display the contents of the ORC file:
df.write.format(‘orc’).save(‘/temp/orc/brainjammer/tiktok.orc’)
spark.sql(“CREATE VIEW BRAIN USING parquet OPTIONS (‘/tmp/output/brainjammer/*’) “)spark.sql(‘SELECT * FROM BRAINWAVE’).show()
+———-+———-+———–+———–+———-+
| Session | Counter | Electrode | Frequency | Measure |
+———-+———-+———–+———–+———-+
| TikTok | 5 | AF3 | THETA | 9.681 |
| TikTok | 5 | AF3 | ALPHA | 3.849 |
| TikTok | 5 | AF3 | GAMMA | 0.738 |
+———-+———-+———–+———–+———-+
Partitioning and querying of the data in an ORC file is the same as previously shown in the parquet section. If you are going to use Apache Hive, then you would want to use the ORC file format to get the best performance. The best uses of ORC include the following:
- When working with read‐heavy analytical applications
- When working with Apache Hive
This file format is a columnar format used in the Hadoop ecosystem, like parquet files. Both ORC and parquet files are often referred to as self‐describing, which means the information that describes the data in the file is contained within the file itself. Metadata typically accompanies the file. For example, in Windows, when you right‐click…
Archives
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- July 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- May 2022
- April 2022
- February 2022
- January 2022
- December 2021
- October 2021
- September 2021
- August 2021
- June 2021
- May 2021
- April 2021
Contact US