Datasets FAQ

Frequently Answered Questions about datasets

Updated over a week ago

What are the supported date formats?

When uploading a dataset to Lang, we support the following date formats:

// ISO-8601 and RFC3339
2006-01-02T15:04:05-0700
2006-01-02 15:04:05.52Z

// MDY with hyphen and slash
01-02-2006
01/02/2006

// YMD
2006/01/02
2006-01-02

What's the recommended file size or number of tickets to upload?

We recommend to upload at least 10,000 documents or 2,000 calls (longer texts) to build a reliable workflow.

Is there a maximum document size to upload?

Our Enterprise product doesn't have a size limitation, although it will be limited by the hardware that is running on, in case it is self-managed. Most of the project should create a reliable classifier with less than 250k documents.

Can I upload short texts? What’s the minimum size of each unit of text?

The technology works well with short texts (tweets, chat conversations, etc…). However the dataset needs to provide enough context, so on average we recommend comments to have at least 5 words. Other short interactions will be identified as noise.

Can I upload long texts or documents? What’s the maximum size of each unit of text?

The technology works well in long texts, however there needs to be a representative amount of texts to extract contexts.

How long does it take to generate a classifier to review?

It depends on the size of the file and the vocabulary. Usually it takes between 5-20 minutes for 5-20k comments. For a 200k texts file it may take several hours.

Can I upload voice data?

Voice data cannot be uploaded into the platform. You first need to do a speech2text process, Microsoft, Google, IBM and Amazon all have speech2text functionalities that can be easily accessed. We also work with partners in case you want a more tailored approach.

Did this answer your question?