Apache Parquet File Format – CREATE DATABASE dbName; GO – Microsoft Azure Data Engineering Associate (DP-203) Study Guide

Apache Parquet File Format – CREATE DATABASE dbName; GO

by Terica Anderson
2024-03-27

Apache Parquet files are used in the Hadoop ecosystem. JSON, CSV, and XML are useful when it comes to sharing data between applications, whereas parquet files perform better for temporarily storing intermediate data between different stages in an application. To get an idea of how to work with this file type, consider the following code snippet:

data = [(‘TikTok’, ‘5’, ‘AF3’, ‘THETA’, ‘9.681’),
(‘TikTok’, ‘5’, ‘AF3’, ‘ALPHA’, ‘3.849’),
(‘TikTok’, ‘5’, ‘AF3’, ‘GAMMA’, ‘0.738’)]
columns = [‘Scenario’, ‘Counter’, ‘Electrode’, ‘Frequency’, ‘Value’]
df = spark.createDataFrame(data, columns)
df.write.parquet(‘/tmp/output/brainjammer/tiktok.parquet’)

That code snippet creates a DataFrame, which is a location in memory for storing data in the context of Apache Spark and Azure Databricks. Then the data can be written to disk in parquet format using the DataFrame write() method, passing the location to write the file as a parameter. Using the spark.sql() method, you could then create an in‐memory view, named BRAIN in our example, and then query it:

spark.sql(“CREATE VIEW BRAIN USING parquet OPTIONS (‘/tmp/output/brainjammer/*’)”)spark.sql(‘SELECT * FROM BRAINWAVE’).show()

The output would look like the following table:

+———-+———-+———–+———–+——-+
| Session | Counter | Electrode | Frequency | Value |
+———-+———-+———–+———–+——-+
| TikTok   | 5       | AF3       | THETA    | 9.681 |
| TikTok | 5       | AF3       | ALPHA    | 3.849 |
| TikTok   | 5         | AF3       | GAMMA     | 0.738 |
+———-+———-+———–+———–+——-+

Here are a few additional benefits of the parquet file format:

Partitioning
Compression
Columnar format
Support for complex types

Partitioning will be described in more detail later, but for now know that it is a method for grouping certain types of data together so that it’s more efficiently stored and therefore more efficient to query. The following snippet first creates a partition on the Wave column and then queries for all Waves that have a value of GAMMA. Calling the show() method would create a table like the previous one, but with only rows where Wave equals GAMMA.

df.write.partitionBy(‘Frequency’).mode(‘overwrite’) \.parquet(‘/tmp/output/brainjammertiktoFrequency.parquet’)data=spark.read\.parquet(‘tmpoutputbrainjammertiktokFrequency.parquetFrequency=GAMMA’)data.show()

Working with a file or partition that contains only the data you require is smaller and therefore more efficient. Parquet files have an advantage when it comes to compression due to their columnar format.

Compression should not be a new term; it means that the data within a file is compacted and reorganized in a way that it requires less storage space. From a relational perspective, we typically think of data in rows, but a parquet file is focused on columns—that is, columnar format. Think about performing a query from a vertical perspective versus a horizontal one, and you should be able to visualize how this would result in less data. There is a term called projection, which is used to limit the columns that are returned from a SQL query of a row. Columnar format is a default SQL projection feature.

CSV and JSON files are typically used for basic data storing, like strings, Booleans, and basic numbers. Parquet files can be used to store more complex data structures like structs, arrays, and even maps, where a map would resemble something like the following:

Then, when you perform a query on datastored in the BRAIN table, the following could be realized:

+———–+———-+———-+
| Electrode | key      | value    |
+———–+———-+———-+
| AF3       | THETA    | 9.681    |
| AF3       | ALPHA    | 3.849    |
| AF3       | GAMMA  | 0.738    |
+———–+———-+———-+

Parquet files are one of the best formats for working with Big Data. The best uses of parquet files include the following:

When you are using Hadoop
Working with read‐heavy analytical applications
Write once, read many (WORM)

Apache Parquet files are used in the Hadoop ecosystem. JSON, CSV, and XML are useful when it comes to sharing data between applications, whereas parquet files perform better for temporarily storing intermediate data between different stages in an application. To get an idea of how to work with this file type, consider the following code…

Apache Parquet File Format – CREATE DATABASE dbName; GO

Leave a Reply Cancel reply

Categories

Archives