Tag: Azure Data Factory

Azure Synapse Data Ingestion Using Pipelines (Azure Plattform Compare Series Part 4)

Hello fellow Data Enthusiasts and welcome to my Blog,

today we’re checking out how to ingest data into Azure Synapse with Pipelines and the Copy Data Task.

Before you start into this post please make sure you know how dynamic linked services work, since we’ll use this here.

Integration Dataset

If you’ve created a dynamic linked service you could get startet with an Integration Dataset. An Integration Dataset is a logical reference to your data. Probably the best feature here is, that they can work schemaless, so you don’t need a new Integration Datset for each source you want to extract.

In order to create one, select the Data tab and then change the view to ‘Linked’. Click the + sign and select ‘Integration Dataset’.

This will open a dialogue which lets you choose all the different sources. In my case I want to Ingest some Files, which are stored on a Datalake.

The Azure Data Lake Storage Gen2 connector has two steps, where you first select the actual connector and afterwards select which kind of files you want to read/write.

Here is an example how the csv connection looks like. Just type a descriptive name and select the linked service, then you’re ready to go.

After creation start to create parameters, since we want to use the same dataset for all csv files that are available for the selected runtime.

Now its possible to add dynamic content to the filepath and the Parameter of the dynamic linked service.

Do the same steps for a parquet integration dataset. I usually try to store the bronce layer in parquet format, since the most ETL tools can work with that. Parquet also supports datatypes, which is nice for sources that also have some kind of metadata.

Pipeline

Pipelines in Azure Synapse are containers for simple logical tasks, that needs to be executed in a certain order. On Pipline level all the tasks are called Activities. I recommend to create a folder structure for the different layers. To create a Pipeline simply click on the Integrate Symbol and afterwards on the + for the Pipeline option.

In order to process all files in a certain folder in the Datalake, we have to start with getting all the filenames. The correct activity to do so, is called ‘Get Metadata’.

If you’ve create a get Metadata activity, it needs to be connected to an integration Dataset. All the Parameters that were created in the Integration Dataset before are shown here, to provide some values. The Get Metadata Task has a mandatory option which is called ‘Field List.’

In order to get the list of all Child Items (aka Files/Folders) you need to use this formular: @activity(‘GM_All_Files’).output.childItems

Now you can create a Filter activity to which filters for example for a certain filename like this: @endswith(item().name,’.csv’)

To iterate over all files add an ForEach activity, where you define to loop over @activity(‘FI_All_CSV_Files’).output.Value.

Within this loop its possible to add multiple activities. In our case we want to use a copy data task where we define the filename as the name of the current item. You can refer to the current item like this @item(). Afterwards you have access to all properties. In our case we want the name property.

As sink we use the parquet linked service. I usually recomment to use the same name as the source name and just replace the filetype.

I hope this little guide gave you a short introduction into Synapse Pipelines and how to ingest data with it into a data plattform.

Dynamic Linked Services in Azure Data Factory (ADF) and Synapse Analytics

Hello fellow Data Enthusiasts and welcome to my Blog,

recently I’ve had a client with about 50 linked services, which all were pretty similar. A Linked Service in ADF and Synapse serves as a connection configuration to external data sources or services. It includes details like endpoint URLs, connection strings, and authentication credentials. Linked Services support various data stores and platforms, such as databases, file systems, and more.

I’m excited to share the improvements I’ve made to the Linked Services, making them more user-friendly. These enhancements not only simplify the process for me but will also greatly benefit your CI/CD workflows.

First of all, some things that i think are barely documented:

  1. Every Linked Service supports parameters (Some also show them in the GUI but even when they aren’t displayed, they are still supported)
  2. Linked Services allow to retrive information from the Key Vault

If you haven’t set up a Key Vault yet, you’ll need to create one and assign the ‘Key Vault Secrets User’ RBAC role to the system-assigned managed identity of the Synapse Workspace or Azure Data Factory.

Then create a Linked Service to the Key Vault.

Afterwards create a Linked Service for each type (e.g. Storage Account, SQL Server, Oracle, SAP) of source system via GUI, which for example looks something like this:

Check the JSON Code of the Linked Service which will look like this:

{
    "name": "AzureDataLakeStorage",
    "type": "Microsoft.Synapse/workspaces/linkedservices",
    "properties": {
        "annotations": [],
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": "https://st0linkedservice.dfs.core.windows.net/"
        },
        "connectVia": {
            "referenceName": "AutoResolveIntegrationRuntime",
            "type": "IntegrationRuntimeReference"
        }
    }
}

and add the “parameters” property and create as many parameters as you need (Since the Storage Account Linked Service has only 1 relevant property only 1 parameter is needed):

{
    "name": "AzureDataLakeStorage",
    "type": "Microsoft.Synapse/workspaces/linkedservices",
    "properties": {
        "parameters": {
            "P_StorageAccountURLSecretName": {
                "type": "string"
            }
        },
        "annotations": [],
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": "https://st0linkedservice.dfs.core.windows.net/"
        },
        "connectVia": {
            "referenceName": "AutoResolveIntegrationRuntime",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Now the trick is, to set an URL based on an key vault secret. To do so, create a Key Vault Secret wich contains the url (StorageAccount1URL = https://st0linkedservice.dfs.core.windows.net/) and create another secret with the url for each storage account you want to connect to. (StorageAccount2URL = https://st0linkedservice2.dfs.core.windows.net/).

Go back to the Linked Service JSON Code and insert this JSON Code in each place where you have different content depending on the source system:

{
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "AzureKeyVault",
                    "type": "LinkedServiceReference"
                },
                "secretName": "@{linkedService().<ParameterName>}"
            }
        }

So in my example it looks like this:

{
    "name": "AzureDataLakeStorage",
    "type": "Microsoft.Synapse/workspaces/linkedservices",
    "properties": {
        "parameters": {
            "P_StorageAccountURLSecretName": {
                "type": "string"
            }
        },
        "annotations": [],
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "AzureKeyVault",
                    "type": "LinkedServiceReference"
                },
                "secretName": "@{linkedService().P_StorageAccountURLSecretName}"
            }
        },
        "connectVia": {
            "referenceName": "AutoResolveIntegrationRuntime",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Now you have the ability to design Datasets, Pipelines and Mapping Dataflows that intelligently adapt to different systems based on specific rules.

For a seamless CI/CD experience, make sure you’ve got the same Key Vault Secrets set up in each environment, each with its unique content. This ensures smooth coordination, especially when dependent systems operate in different environments.

This post should introduce you to the basic concept of dynamic linked services in Azure Data Factory (ADF) and Azure Synapse Analytics. I hope this post will help you to drastically reduce the number of linked services and integration datasets in your further projects.

© 2024 Geckert Consulting

Theme by Anders NorenUp ↑