Apache Parquet File Format – CREATE DATABASE dbName; GO
Apache Parquet files are used in the Hadoop ecosystem. JSON, CSV, and XML are useful when it comes to sharing data between applications, whereas parquet files perform better for temporarily storing intermediate data between different stages in an application. To get an idea of how to work with this file type, consider the following code snippet:
data = [(‘TikTok’, ‘5’, ‘AF3’, ‘THETA’, ‘9.681’),
(‘TikTok’, ‘5’, ‘AF3’, ‘ALPHA’, ‘3.849’),
(‘TikTok’, ‘5’, ‘AF3’, ‘GAMMA’, ‘0.738’)]
columns = [‘Scenario’, ‘Counter’, ‘Electrode’, ‘Frequency’, ‘Value’]
df = spark.createDataFrame(data, columns)
df.write.parquet(‘/tmp/output/brainjammer/tiktok.parquet’)
That code snippet creates a DataFrame, which is a location in memory for storing data in the context of Apache Spark and Azure Databricks. Then the data can be written to disk in parquet format using the DataFrame write() method, passing the location to write the file as a parameter. Using the spark.sql() method, you could then create an in‐memory view, named BRAIN in our example, and then query it:
spark.sql(“CREATE VIEW BRAIN USING parquet OPTIONS (‘/tmp/output/brainjammer/*’)”)spark.sql(‘SELECT * FROM BRAINWAVE’).show()
The output would look like the following table:
+———-+———-+———–+———–+——-+
| Session | Counter | Electrode | Frequency | Value |
+———-+———-+———–+———–+——-+
| TikTok | 5 | AF3 | THETA | 9.681 |
| TikTok | 5 | AF3 | ALPHA | 3.849 |
| TikTok | 5 | AF3 | GAMMA | 0.738 |
+———-+———-+———–+———–+——-+
Here are a few additional benefits of the parquet file format:
- Partitioning
- Compression
- Columnar format
- Support for complex types
Partitioning will be described in more detail later, but for now know that it is a method for grouping certain types of data together so that it’s more efficiently stored and therefore more efficient to query. The following snippet first creates a partition on the Wave column and then queries for all Waves that have a value of GAMMA. Calling the show() method would create a table like the previous one, but with only rows where Wave equals GAMMA.
df.write.partitionBy(‘Frequency’).mode(‘overwrite’) \.parquet(‘/tmp/output/brainjammertiktoFrequency.parquet’)data=spark.read\.parquet(‘tmpoutputbrainjammertiktokFrequency.parquetFrequency=GAMMA’)data.show()
Working with a file or partition that contains only the data you require is smaller and therefore more efficient. Parquet files have an advantage when it comes to compression due to their columnar format.
Compression should not be a new term; it means that the data within a file is compacted and reorganized in a way that it requires less storage space. From a relational perspective, we typically think of data in rows, but a parquet file is focused on columns—that is, columnar format. Think about performing a query from a vertical perspective versus a horizontal one, and you should be able to visualize how this would result in less data. There is a term called projection, which is used to limit the columns that are returned from a SQL query of a row. Columnar format is a default SQL projection feature.
CSV and JSON files are typically used for basic data storing, like strings, Booleans, and basic numbers. Parquet files can be used to store more complex data structures like structs, arrays, and even maps, where a map would resemble something like the following:
DESCRIBE BRAIN
+———–+——————–+
| name | type |
+———–+——————–+
| Electrode | string |
| Frequency | map<string,value> |
+———–+——————–+
Then, when you perform a query on datastored in the BRAIN table, the following could be realized:
+———–+———-+———-+
| Electrode | key | value |
+———–+———-+———-+
| AF3 | THETA | 9.681 |
| AF3 | ALPHA | 3.849 |
| AF3 | GAMMA | 0.738 |
+———–+———-+———-+
Parquet files are one of the best formats for working with Big Data. The best uses of parquet files include the following:
- When you are using Hadoop
- Working with read‐heavy analytical applications
- Write once, read many (WORM)
Apache Parquet files are used in the Hadoop ecosystem. JSON, CSV, and XML are useful when it comes to sharing data between applications, whereas parquet files perform better for temporarily storing intermediate data between different stages in an application. To get an idea of how to work with this file type, consider the following code…
Archives
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- July 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- May 2022
- April 2022
- February 2022
- January 2022
- December 2021
- October 2021
- September 2021
- August 2021
- June 2021
- May 2021
- April 2021
Contact US