Skip to the content

Batch Data Cleansing API 'How To' Guide

This guide will demonstrate the basic process of using our Batch Data Cleansing API's.

A simple cleansing process might look something like:

To run a job through the API's, the workflow must first be created by our Production team to meet your needs and cleanse the data in a way that would be most suitable for you.

Each workflow is tailored to a client, and therefore this setup step is required when first using our API's, or when creating a new type of job.


The full documentation for any of the Batch Data Cleansing API endpoints can be found here:

Authenticate With Client Credentials

All requests must be authenticated using an Authorization: Bearer header, with the bearer token being obtained from the Data8 OAuth token server at https://auth.data-8.co.uk/connect/token

To generate an authentication token, you will need to provide your client ID and client secret which will be sent to you securely after a workflow is set up for you by the Production team.

An example of how the authentication request may look:

public static async Task<ApplicationException> AuthenticateUserDetails(HttpClient client, string authUrl, string clientId, string clientSecret)
{
    var disco = await client.GetDiscoveryDocumentAsync("https://auth.data-8.co.uk/");

    if (disco.IsError)
        throw new ApplicationException(disco.Error);

    // ClientId and ClientSecret will be sent to you along with the details of your workflow.
    token = await client.RequestClientCredentialsTokenAsync(new ClientCredentialsTokenRequest
    {
        Address = disco.TokenEndpoint,
        ClientId = clientId,
        ClientSecret = clientSecret,
        Scope = "BatchApi"
    });

    if (token.IsError)
        throw new ApplicationException(token.Error);

    return null;
}

Create a Dataset

Each job requires an input dataset to be supplied in the API call, containing the data to cleanse. These datasets must contain at least the required columns, but can also contain additional data. The minimum requirements of the dataset will be made clear when the workflow is initially set up for you by the Production team.

Each dataset must have a unique name. The name must start with a character a-z and can only include the characters a-z, 0-9 or _.

To create a dataset, make a POST request to https://api.data-8.co.uk/Dataset. When the dataset is created it is marked as incomplete.

Add data to the dataset using the PATCH /Dataset/{name}/data endpoint.

Once all data has been added, use the PUT /Dataset/{name} endpoint to mark the dataset as complete before using it as input to a job.


Other Endpoints:

Starting a Job

The details to be passed to this endpoint will vary depending on how your workflow has been configured. Full details of what each workflow is expecting in terms of input files, datasets and parameters will be agreed with you by the Data8 Production Team.

Start a job by making a POST request to `https://api.data-8.co.uk/Job, passing in the job name, workflow name, input filename, input dataset(s) and any necessary parameters.

Datasets must be marked as complete before starting a job!

Once the job has been submitted it can be monitored by polling the GET /Job/{name} endpoint.


Other Endpoints:

Retrieve The Output Dataset

To retrieve the output dataset from the job, make a GET request to the /Dataset/{name}/data endpoint, passing the dataset name, starting record number (downloaded in chunks of up to 1,000, starting at 1), and the total number of records to download.

The dataset given by the name parameter must be a completed output dataset. If the dataset is an input dataset, or if it is incomplete, a 400 error is returned.

For example, make a GET request to https://api.data-8.co.uk/Dataset/my_output_dataset_001, passing in the 'name' parameter as 'my_output_dataset_001', the start parameter as 1 and the count parameter as 1000.

Start a Free 30 Day Trial Today

Start a free trial today