Join us for an expert-led overview of the tools and concepts you'll need to pass exam PL-300. The first session starts on June 11th. See you there!
Get registeredJoin us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered
Imagine you're prepping for a study session, but your backpack is filled with a jumbled mess of books—math, science, history, and art. Every time you need one, you have to dig through it all. It's tedious and slow. Now, picture this: you organize the backpack by subject—math in one section, science in another, and so on. The next time you need a book, you can grab it right away—much quicker, right?
This simple idea applies to data systems, too. Just like sorting your backpack, V-Order organizes data efficiently, reducing the amount of data scanned during queries and speeding up read times.
So, what exactly is V-Order? 🔍
V-Order is an open-source, Parquet-compliant optimization technique designed to improve read performance and reduce storage needs in data-heavy environments, boosting overall performance.
How V-Order Optimizes Parquet Files 🛠️
Verti-Scan Technology: Enhances performance with tools like PowerBI and SQL.
Reduced Read Times: Cuts read times by up to 50% in Spark environments.
Efficient Storage: Uses sorting, row group distribution, and compression for better storage.
Resource Efficiency: Lowers network, disk, and CPU usage, reducing costs.
Full Compatibility: Works with Z-Order and time travel for optimized table partitions.
In today’s data-driven world, V-Order streamlines data operations, making them faster, more cost-efficient, and reliable.
Before adopting any new optimization technique, it’s crucial to validate its effectiveness in real-world scenarios. In this section, we put V-Order to the test. Through a series of carefully designed experiments, we aim to assess how different factors—such as dataset size, structure, and Fabric configuration—affect its performance. The ultimate goal is clear: to determine where V-Order delivers a tangible benefit and where its impact might be less noticeable.
We ran a set of experiments using a sample employee dataset to showcase the performance and storage benefits of V-Order optimization. Here’s a breakdown of the data and setup:
Number of Records: ~900 million
Dataset Size: 39.69 GB
Schema (Datatype Distribution):
Total Columns: 80
Columns used for querying
personnelnumber: string, fulltimeweeklyworkhours: double, workcountrydesc: string, lastmodifieddatetime: timestampStructure: The dataset includes duplicate records per personnelnumber, which helps us better observe how V-Order affects both query speed and storage efficiency.
Looking Ahead: Future posts will dive deeper into how V-Order performs on de-duplicated datasets, offering a direct comparison between the two scenarios.
Region: West Central US
Workspace Capacity (SKU): FT1
Semantic Model Storage Format: Small semantic model storage format
To ensure V-Order is enabled, we ran a few checks:
Spark Session:
%%spark spark.conf.get('spark.sql.parquet.vorder.default')
Spark SQL:
%%sql SHOW TBLPROPERTIES <table-name>
There are several ways to enable V-Order optimization, depending on how you are interacting with the data in your Fabric notebook. Below are three common methods:
Spark Session:
%%spark spark.conf.set('spark.sql.parquet.vorder.default', "true")
Spark SQL:
%%sql CREATE TABLE person (id INT, name STRING, age INT) USING parquet TBLPROPERTIES("delta.parquet.vorder.default"="true");
Fabric Lakehouse:
Navigate to your Lakehouse
Right-click → Select Maintenance
Run Optimize command to maximize reading speeds using V-Order → Run Now
Once the table property is set to true, all INSERT, UPDATE, and MERGE commands will follow V-Order optimization.
If V-Order is enabled in the Spark Session, all write commands will follow optimization, even if the table property is set to false.
To optimize existing tables using Spark SQL, you can use the OPTIMIZE command:
%%sql OPTIMIZE <table|fileOrFolderPath> VORDER; OPTIMIZE <table|fileOrFolderPath> WHERE <predicate> VORDER;
We also tested the effectiveness of V-Order across a range of Spark pool configurations to ensure it works well under different setups:
Starter Pool: Size: Medium; Node Family: Memory optimized; Nodes: Autoscale (1-10)
Small Pool: Size: Small; Node Family: Memory optimized; Nodes: 3
Medium Pool: Size: Medium; Node Family: Memory optimized; Nodes: 3
Large Pool: Size: Large; Node Family: Memory optimized; Nodes: 3
To evaluate V-Order's impact on write performance, storage optimization and read efficiency we followed these steps:
Dataset Ingestion: We first ingested the dataset into the table without V-Order, then repeated the process with V-Order enabled.
Query Execution: After preparing both tables, we executed various queries—basic aggregations, window functions, predicate filters, and group-by operations—to assess read efficiency.
Testing Across Spark Configurations: All tests were conducted across four different Spark pool configurations, with results visualized below.
Below is a summarized view of how V-Order influenced write, read and storage metrics across various spark pools.
Write Performance
Write Time (seconds) across Spark pools
Storage Optimization
Storage Size (GB) across Spark pools
Read Efficiency
Query Types: Dataset Operations Overview 🔍
Here’s a brief overview of the query types used to evaluate the performance of V-Order on the dataset:
Basic Aggregations: Operations like Count, Average, and Sum on the fulltimeweeklyworkhours column to test aggregation efficiency.
Window Functions with Predicate: Lag and Lead functions partitioned by personnelnumber and lastmodifieddatetime, filtered by a predicate on personnelnumber and workcountrydesc.
Window Functions without Predicate: Sum and Percentage calculations over a window partitioned by workcountrydesc.
Group By Aggregations: Sum of fulltimeweeklyworkhours grouped by personnelnumber and workcountrydesc.
Average Analytical Query: A complex query using CTEs, window functions, aggregations, and joins to test overall performance.
Read Time (seconds) and Data Scanned (MB)
Read Time (Seconds) for Average Analytical Query
Data Scanned (MB) for Various Query Types
Automatic Write Optimization: V-Order triggers a Delta optimization job during Spark writes, indexing and compressing data, significantly reducing storage size by leveraging sorting and compression.
Reduced Data Scanned: Queries on V-Order-enabled tables scan less data, improving query speed and efficiency.
Targeted Query Efficiency: V-Order boosts performance (10-40%) when only specific columns are queried but offers limited benefits for full-table scans.
Faster Execution: Enabling V-Order reduces execution times by ~30%, enhancing processing speed across various data operations.
Optimized Data Read: Compression and sorting reduce data read volume by up to 50%, speeding up queries and lowering resource use.
Fewer Spark Tasks: V-Order reduces the number of Spark executor tasks by about 50%, improving resource allocation and parallelism.
Based on our testing results, here are some recommendations on when to enable V-Order, particularly within a traditional medallion architecture. The decision should depend on your workload characteristics and data access patterns at each layer.
Consider enabling V-Order if your workload involves:
Frequent ad-hoc queries
Analytics
Scenarios where users access specific subsets of columns
Zone | Delta | Purpose | Enable V-Order | Rationale |
Bronze | No | Raw data from source system with full change history | No | Bronze tables are primarily written once and rarely read. Since read optimization is minimal, V-Order offers little value here. |
Silver | Yes | Clean data – Apply business logic | Maybe |
Silver tables may experience a balanced mix of reads and writes. Enable V-Order if the table is read frequently or used in downstream aggregations and disable it for write-heavy operations.
|
Gold | Yes | Dimensions and Fact Models for reporting and query | Yes | Gold tables are heavily read for analytics and reporting. V-Order significantly improves query performance in this zone.
|
The benefits of V-Order are most noticeable in read-heavy environments, where full-table scans are not common. In these cases, V-Order can improve query speed and reduce resource usage.
The experiment demonstrates that V-Order is a powerful optimization for improving query performance in Spark environments. On average, V-Order resulted in:
~30% reduction in execution time
~50% less data read
~50% fewer executor tasks
V-Order excels in read-heavy, analytical workloads, especially when queries focus on specific columns. While results may vary based on dataset characteristics, the overall performance boost is consistent across most analytical scenarios.
Its impact is most notable in the silver and gold layers of a medallion architecture, where data is cleaned, modelled, and frequently queried. While less effective for full-table scans or write-heavy operations, V-Order can drive significant performance improvements and optimize resource use in the right scenarios.
Further reading and resources on V-Order and related topics can be found in the appendix and provided links.
@kranthimeda @RajeshDadi @rbhallamudi @gachoure @Ayush_Tiwari
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.