Skip to the content

Batch Data Cleansing API 'How To' Guide

This guide will demonstrate the basic process of using our Batch Data Cleansing API's.

A simple cleansing process might look something like:

To run a job through the API's, the workflow must first be created by our Production team to meet your needs and cleanse the data in a way that would be most suitable for you.

Each workflow is tailored to a client, and therefore this setup step is required when first using our API's, or when creating a new type of job.


The full documentation for any of the Batch Data Cleansing API endpoints can be found here:

Create a Dataset

Each job requires an input dataset to be supplied in the API call, containing the data to cleanse. These datasets must contain at least the required columns, but can also contain additional data. The minimum requirements of the dataset will be made clear when the workflow is initially set up for you by the Production team.

Each dataset must have a unique name. The name must start with a character a-z and can only include the characters a-z, 0-9 or _.

To create a dataset, make a POST request to https://api.data-8.co.uk/Dataset. When the dataset is created it is marked as incomplete.

Add data to the dataset using the PATCH /Dataset/{name}/data endpoint.

Once all data has been added, use the PUT /Dataset/{name} endpoint to mark the dataset as complete before using it as input to a job.


Other Endpoints:

Starting a Job

The details to be passed to this endpoint will vary depending on how your workflow has been configured. Full details of what each workflow is expecting in terms of input files, datasets and parameters will be agreed with you by the Data8 Production Team.

Start a job by making a POST request to `https://api.data-8.co.uk/Job, passing in the job name, workflow name, input filename, input dataset(s) and any necessary parameters.

Datasets must be marked as complete before starting a job!

Once the job has been submitted it can be monitored by polling the GET /Job/{name} endpoint.


Other Endpoints:

Retrieve The Output Dataset

To retrieve the output dataset from the job, make a GET request to the /Dataset/{name}/data endpoint, passing the dataset name, starting record number (downloaded in chunks of up to 1,000, starting at 1), and the total number of records to download.

The dataset given by the name parameter must be a completed output dataset. If the dataset is an input dataset, or if it is incomplete, a 400 error is returned.

For example, make a GET request to https://api.data-8.co.uk/Dataset/my_output_dataset_001, passing in the 'name' parameter as 'my_output_dataset_001', the start parameter as 1 and the count parameter as 1000.