Solved: How to orchestrate streaming pipelines

innocence84 · ‎05-14-2025

Hi all,

I have a notebook that reads data from the Azure event hub and write the data into bronze layer. Spark streaming is used. We are installing libraries via pip so we created an environment that has these Python libraries.

then I will have a notebook for ingesting data from bronze layer into silver layer by cleansing data.

then I will have a notebook for ingesting data from silver to gold.

The client requests minimum latency.
So should I run first notebook continuously. How can I schedule first notebook continuously so I do not wait for cluster warm and ingest data quickly when it arrived to the event hub

v-lgarikapat · ‎05-20-2025

Hi @innocence84 ,
Great follow-up questions!
Is clicking Run All enough for keeping the notebook session alive?
Clicking "Run All" will start the notebook execution, but it does not guarantee that the notebook session will stay alive indefinitely. Spark Structured Streaming requires a continuously active session to keep ingesting data.

To keep the session alive:
Do not close the notebook or browser tab.
Make sure your Microsoft Fabric capacity is set to stay on (not auto-pause), so the Spark session doesn't terminate due to inactivity or capacity shutdown.
Alternatively, use a Fabric Pipeline with a trigger to re-launch the notebook automatically, although this may cause some latency during cluster spin-up unless the capacity is always-on.
So, while "Run All" starts the job, the session must be kept open and the Spark cluster active for it to keep running as a true long-running job.
What is always-on execution model?
The always-on execution model means that the compute resources (Spark capacity) are continuously running, so notebooks and streaming jobs can execute without delays caused by cluster spin-up or cold starts.
In Microsoft Fabric, this is typically achieved by:
Keeping the capacity always-on under the Fabric settings (i.e., prevent auto-pause).
Using Fabric Pipelines to orchestrate jobs in a way that aligns with this model (e.g., triggering notebooks as soon as new data arrives or at regular intervals without waiting for cluster startup).
This model is crucial for low-latency streaming scenarios, where immediate data processing is required without downtime or lag due to cluster initialization.
Ingest, filter, and transform real-time events and send them to a Microsoft Fabric lakehouse - Micro...

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.
We appreciate your engagement and thank you for being an active part of the community.

Best regards,
LakshmiNarayana.

View solution in original post

v-lgarikapat · ‎05-20-2025

Hi @innocence84 ,
Great follow-up questions!
Is clicking Run All enough for keeping the notebook session alive?
Clicking "Run All" will start the notebook execution, but it does not guarantee that the notebook session will stay alive indefinitely. Spark Structured Streaming requires a continuously active session to keep ingesting data.

To keep the session alive:
Do not close the notebook or browser tab.
Make sure your Microsoft Fabric capacity is set to stay on (not auto-pause), so the Spark session doesn't terminate due to inactivity or capacity shutdown.
Alternatively, use a Fabric Pipeline with a trigger to re-launch the notebook automatically, although this may cause some latency during cluster spin-up unless the capacity is always-on.
So, while "Run All" starts the job, the session must be kept open and the Spark cluster active for it to keep running as a true long-running job.
What is always-on execution model?
The always-on execution model means that the compute resources (Spark capacity) are continuously running, so notebooks and streaming jobs can execute without delays caused by cluster spin-up or cold starts.
In Microsoft Fabric, this is typically achieved by:
Keeping the capacity always-on under the Fabric settings (i.e., prevent auto-pause).
Using Fabric Pipelines to orchestrate jobs in a way that aligns with this model (e.g., triggering notebooks as soon as new data arrives or at regular intervals without waiting for cluster startup).
This model is crucial for low-latency streaming scenarios, where immediate data processing is required without downtime or lag due to cluster initialization.
Ingest, filter, and transform real-time events and send them to a Microsoft Fabric lakehouse - Micro...

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.
We appreciate your engagement and thank you for being an active part of the community.

Best regards,
LakshmiNarayana.

v-lgarikapat · ‎05-25-2025

Hi @innocence84 ,

If your question has been answered, kindly mark the appropriate response as the Accepted Solution. This small step goes a long way in helping others with similar issues.

We appreciate your collaboration and support!

Best regards,
LakshmiNarayana

v-lgarikapat · ‎05-29-2025

Hi @innocence84 ,

If your issue has been resolved, please mark the most helpful reply as the Accepted Solution to close the thread. This helps ensure the discussion remains useful for other community members.

Thank you for your attention, and we look forward to your confirmation.

Best regards,
LakshmiNarayana

v-lgarikapat · ‎05-23-2025

Hi @innocence84 ,

If your issue has been resolved, please consider marking the most helpful reply as the accepted solution. This helps other community members who may encounter the same issue to find answers more efficiently.

If you're still facing challenges, feel free to let us know—we’ll be glad to assist you further.

Looking forward to your response.

Best regards,
LakshmiNarayana.

Srisakthi · ‎05-20-2025

Hi @v-lgarikapat

I saw your reply and it was stated as below, i dont think so these points are valid can you share the documentation url ? There is a default session timout which 20 mins and we can change that , but after idle time session will stop. There is no way to keep the session alive. So request you to share the documentation url.

To keep the session alive:

Do not close the notebook or browser tab.
Make sure your Microsoft Fabric capacity is set to stay on (not auto-pause), so the Spark session doesn't terminate due to inactivity or capacity shutdown.
Alternatively, use a Fabric Pipeline with a trigger to re-launch the notebook automatically, although this may cause some latency during cluster spin-up unless the capacity is always-on.

Regards,

Srisakthi

v-lgarikapat · ‎05-18-2025

Hi @innocence84 ,

Great question!
To run Notebook 1 as a long-running Spark Structured Streaming job in Microsoft Fabric, follow these steps:
Where to run Notebook 1:
Run it inside the Microsoft Fabric Lakehouse environment, specifically in a Lakehouse notebook with Spark runtime.
Fabric supports Spark-based notebooks for streaming workloads using Delta Lake and Event Hub integration.
How to run it as a long-running job:
Create a Lakehouse Notebook:
In Microsoft Fabric, go to your Lakehouse.
Create or open Notebook 1.
Write your structured streaming code to read from Azure Event Hub and write to Delta Lake (bronze layer).
Use Spark Structured Streaming API in PySpark:
Example:
df = (
    spark.readStream
    .format("eventhubs")
    .option("eventhubs.connectionString", "<your_connection_string>")
    .load()
)
df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "Tables/bronze/_checkpoints/") \
    .start("Tables/bronze/")
Set the notebook to run continuously:
When starting the notebook, choose "Run All" to execute the code.
Keep the notebook session alive (do not rely on scheduled runs).
Alternatively, you can run this notebook from a Microsoft Fabric Pipeline with a trigger but ensure cluster warm-up is minimized (e.g., keep capacity always-on).
Use Fabric Pipelines for orchestration (optional):
You can orchestrate Notebook 1 using Fabric Pipelines with a trigger-based or always-on execution model.
This ensures the job runs without manual intervention and maintains a low-latency ingestion pipeline.

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.

We appreciate your engagement and thank you for being an active part of the community.

Best Regards,
Lakshmi Narayana.

innocence84 · ‎05-19-2025

thank you,

for the solution 1, is clicking run all enough for Keep the notebook session alive ?

for the solution 2, what is always-on execution model.?

v-lgarikapat · ‎05-15-2025

Hi ,
Thank you for reaching out to the Microsoft Community Forum.
To meet your client’s requirement for minimum latency, consider the following implementation strategy:
1.Run Notebook 1 as a long-running Spark Structured Streaming job:
This notebook should read from Azure Event Hub and write to the bronze layer.
Running it continuously ensures that data is ingested immediately as it arrives.
Avoid scheduling it periodically, as that causes cold-start latency due to cluster warm-up.
2.Ensure Python dependencies are pre-installed:
oSince you're using pip to install libraries, configure your Fabric runtime environment to include these packages ahead of time.
This eliminates delays caused by dependency installation during job startup.
3.Keep the Spark session active:
Run the streaming notebook as a persistent job in the Fabric Lakehouse environment.
Alternatively, use Microsoft Fabric Pipelines to trigger the notebook without delay when needed.
4.Configure Notebooks 2 and 3 for downstream processing:
Notebook 2 (Bronze → Silver) and Notebook 3 (Silver → Gold) can:
Run as streaming jobs, or
Be scheduled at short intervals using micro-batching, depending on the transformation logic.
5.Use Delta format with checkpointing:
Store all data in Delta Lake format for performance, versioning, and streaming support.
Implement checkpointing for each streaming job to ensure fault tolerance and recovery.
6.Orchestrate the pipeline using Microsoft Fabric Pipelines:
Define dependencies, execution order, and triggers for all three notebooks.
Ensure the Fabric capacity stays active during streaming operations.
7.Optionally integrate Data Activator:
Use Data Activator to trigger downstream notebooks based on specific data conditions or arrival events.

Streaming data into lakehouse - Microsoft Fabric | Microsoft Learn
If my response has resolved your query, please mark it as the 'Accepted Solution' to assist others. Additionally, a 'Kudos' would be appreciated if you found my response helpful.

Thank you
LakshmiNarayana

innocence84 · ‎05-16-2025

How and where can I run Notebook 1 as a long-running Spark Structured Streaming job?

Srisakthi · ‎05-14-2025

Hi @innocence84 ,

I would like to understand the requirement more. Do you want to achieve everything only with Notebooks?

MS Fabric has real time intelligence where you can use event stream data, transform and store. I would request you to take a look at it. It might suit your case.

Also think about this flow whether this suits your needs,

1. Setup MS Fabric Real time streaming(can also specify latency), apply available transformation and store in bronze layer(lakehouse)

2. Use notebook to transform to further layers like silver, gold

Note: Cost implication will be there, as event stream will poll continuously

Regards,

Srisakthi

If this answer solves your question , please mark "Accept as Solution.

innocence84 · ‎05-16-2025

We have complex needs like schema validation etc so we need to use spark streaming.

How to orchestrate streaming pipelines

Helpful resources

Fabric Monthly Update - May 2025

Fabric Community Update - May 2025

Become a Certified Power BI Data Analyst!

How to orchestrate streaming pipelines

Helpful resources

Fabric Monthly Update - May 2025

Fabric Community Update - May 2025