Getting started

Data training

Get started with custom data training

This guide will take you through each step necessary to prepare and upload data to our systems.

1. Data requirements

Lucidtech offers an API that can extract key-information from your documents, whether you have invoices, receipts, ID-documents, Purchase Orders or practically any other type of documents. To make sure that our API provides optimal accuracy we train our models on your data. We use supervised learning for training our machine learning models. This means that the algorithms learn by observing thousands of examples of documents together with their ground truth. The goal of the training process is that Lucidtech’s models learn to produce the correct output for new and previously unseen documents.

When extracting training data, the following considerations should be made:

1.1 Volume

The amount of data needed to create a high quality API depends on the complexity of the API and your requirements. As a general rule of thumb we recommend you to provide at least 5000 documents as a start. When the API is running in production, the model will benefit from continuous learning from all new data.

1.2 Representative data

The training data should be representative for the expected data. This means that the training data should be similar to the expected data for in terms of similarity and variation. For example, if the expected data consists of invoices from thousands of different vendors, then the training data should not only consist of invoices from five different vendors.

1.3 Correctness of data

Incorrect or missing ground truth information can be detrimental to the training process. For this reason it is important that the training data is as accurate as possible.

1.4 Consistency

Ground truth data should adhere to a common format. For example, when extracting dates, all ground truth dates should be listed on the same format to prevent some dates from being written 17.05.18, while others are written 17th of May, 2018.

2. Preparing data for our systems

2.1 Decide what you want to extract

The first step is to decide which data fields you want to extract from your documents. For an invoice this can be total amount, due date and bank account, or it can also be only total amount. For an ID document it can be first name, last name, id-number and nationality. For a travel ticket it can be price, departure date, arrival date, seat number and mean of transportation. Which data fields you want to extract is up to you as a customer to decide, our general recommendation is to keep it as simple as possible. Do not add fields that you will not use, and make sure that the majority of the data you provide contain the fields you specify.

3. Every document needs a ground truth

To start training your custom model you need pairs of documents and their corresponding ground truth. The ground truth is the information you want to extract from the document. Note that every single document needs its own ground truth file.

3.1 Format of the ground truth

The ground truth needs to be a .json-file specified on the following format;


                  {
  "document":{
    "path": "<path/to/document.jpg>",
    "type": "receipt"
  },
  "labels":{
    "<your_first_field_name>":  "<ground_truth_value_for_this_documents_first_field>",
    "<your_second_field_name>":  "<ground_truth_value_for_this_documents_second_field>",
    "<your_third_field_name>":  "<ground_truth_value_for_this_documents_third_field>"
  }
}

If some of your documents does not contain a certain field, then omit that field in the "labels" section of your .json file. Below are some examples of documents and their ground truth .json files.

Examples of receipts with corresponding ground truth values.


            {
  "document":{
    "path": "lucidcab001.jpeg",
    "type": "receipt"
  },
  "labels":{
    "category": "taxi",
    "currency": "EUR",
    "date": "2019-12-31",
    "total_amount": "43.90"
  }
}


            {
  "document":{
    "path": "receipt-cafe.pdf",
    "type": "receipt"
  },
  "labels":{
    "category": "food",
    "currency": "USD",
    "date": "2020-03-02",
    "total_amount": "3.60"
  }
}


            {
  "document":{
    "path": "bigbenbar.png",
    "type": "receipt"
  },
  "labels":{
    "category": "refreshments",
    "currency": "GBP",
    "date": "2020-16-05",
    "total_amount": "18.80"
  }
}

Examples of bill of lading documents with corresponding ground truth values.


            {
  "document":{
    "path": "logistics01.pdf",
    "type": "bill-of-lading"
  },
  "labels":{
    "lading_number": "s00158002",
    "point_of_loading": "Chicago, United States",
    "place_of_delivery": "Sydney, Australia",
    "date": "2015-06-23",
    "carrier": "Lucid Logistics",
    "weight": "20000"
  }
}


            {
  "document":{
    "path": "scan-04.jpeg",
    "type": "bill-of-lading"
  },
  "labels":{
    "lading_number": "3411000",
    "point_of_loading": "St. Louis, United States",
    "place_of_delivery": "Montgomery, United States",
    "date": "2016-02-25",
    "carrier": "ABF FREIGHT LINES",
    "weight": "2900"
  }
}

3.2 File structure

📁 training-data

invoice-43492.json
invoice-43492.jpg
invoice-example-243.json
invoice-example-243.pdf
98-example-invoice.json
98-example-invoice.png

4. Checklist before uploading your data

Every ground truth file contains a path to a unique document.
Every ground truth file contains all the fields you are interested in. Use empty string if a particular document does not contain a field.
All ground truths follow a consistent formatting.

Log in to our platform to start data training

Use our Data Training Platform for safe and compliant data transfer.

Go to platform