The torch-dataframe – subsetting and sampling

Subsetting and batching is like dealing cards – should be random unless you are doing a trick. The image is cc from Steven Depolo.

In my previous two posts I covered the most basic data manipulation that you may need. In this post I’ll try to give a quick introduction to some of the sampling methods that we can use in our machine learning projects.

# All posts in the *torch-dataframe* series

1. [Intro to the torch-dataframe][intro]
2. [Modifications][mods]
3. [Subsetting][subs]
4. [The mnist example][mnist ex]
5. [Multilabel classification][multilabel]

[intro]: http://gforge.se/2016/08/deep-learning-with-torch-dataframe-a-gentle-introduction-to-torch/
[mods]: http://gforge.se/2016/08/the-torch-dataframe-basics-on-modifications/
[subs]: http://gforge.se/2016/08/the-torch-dataframe-subsetting-and-sampling/
[mnist ex]: http://gforge.se/2016/08/integration-between-torchnet-and-torch-dataframe-a-closer-look-at-the-mnist-example/
[multilabel]: http://gforge.se/2016/08/setting-up-a-multilabel-classification-network-with-torch-dataframe/

First we start with loading the mtcars dataset same way as we have [previously][intro post]:

[intro post]: http://gforge.se/2016/08/deep-learning-with-torch-dataframe-a-gentle-introduction-to-torch/

“`lua
require ‘Dataframe’
mtcars_df = Dataframe(“mtcars.csv”):
rename_column(“”, “rownames”):
drop(Df_Array(“cyl”, “disp”, “drat”, “vs”, “carb”))
“`

# Splitting the data

A common strategy is to split our dataset into three sets

– *train*: the examples that we will train our model on
– *validate*: the examples that we will use to check how our model is doing
– *test*: the examples that we “lock away into a vault” and only look at once we have decided and trained our final model.

The split proportions can differ depending on application and dataset. The default proportions used in the torch-dataframe package are 70%, 20% and 10%. The split is achieved via the `create_subsets` function:

“`lua
mtcars_df:create_subsets(Df_Dict{train=5, validate=3, test=2})
“`

If you provide custom split proportions and numbers don’t sum to 1 and the function automatically normalizes the values and prints a warning `Warning: You have provided a total ~= 1 (10)`.

Now you have three subsets in your data that you can access via the `get_subset` method or just via `[“/subset_name”]`:

“`lua
th> mtcars_df[“/train”]:size()
16
[0.0001s]
“`

*Note*: The current implementation (v. 1.5) of torch-dataframe is a shallow wrapper around the parent data only containing a list of elements. If you print a subset you will see the indexes from the original dataset that are included within this particular dataset:

“`lua
th> mtcars_df[“/test”]

+———+
| indexes |
+———+
| 13 |
| 3 |
| 20 |
| 19 |
| 18 |
| 27 |
| 30 |
+———+

[0.0009s]
“`

# Samplers

Many machine learning procedures follow the same steps:

1. split your data
2. sample from the training subset a random batch
3. perform a calculation on that batch and update your parameters accordingly
4. restart from 2.

The torch-dataframe tries therefore to make this entire process as painless as possible. We have also extended [torch-dataset’s](https://github.com/twitter/torch-dataset) excellent samplers that allow you to sample using the following approaches:

* **linear**: Does a linear walk through the data. *Note* that the subsetting already does a random permutation to your data unless you only have one subset.
* **ordered**: Sorts the indexes and then does a linear walk through the data.
* **permutation**: Reorganizes the order (permutes) and then walks through the rows. After each epoch you must reset and this resetting creates a new permutation.
* **uniform**: Samples uniformly from the data. This means that within one epoch the same example may occur several times and some won’t appear at all.
* **label permutation**: Permutes the data according to labels. This means that we can make sure that the training appears evenly distributed between labels.
* **label uniform**: Samples uniformly but according to labels.
* **label distribution**: Samples according to specific distributions.

You choose your samplers either during the `create_subsets` call using the `sampler` argument or you can set them later for each subset using the `set_sampler` function. Here is an example where you also must set the label:

“`lua
th> mtcars_df[“/train”]:set_labels(“gear”):set_sampler(“label-permutation”)
th> mtcars_df[“/train”]:get_batch(4)

+————————————————————-+
| rownames | mpg | hp | wt | qsec | am | gear |
+————————————————————-+
| Merc 240D | 24.4 | 62 | 3.19 | 20 | Automatic | 4 |
| Merc 280C | 17.8 | 123 | 3.44 | 18.9 | Automatic | 4 |
| Merc 450SLC | 15.2 | 180 | 3.78 | 18 | Automatic | 3 |
| Maserati Bora | 15 | 335 | 3.57 | 14.6 | Manual | 5 |
+————————————————————-+

false
[0.0063s]
“`

Using the sampler is done by calling the `get_batch`. The second argument returned from `get_batch` is whether the `reset_sampler` should be invoked. *Note* this is only required for some of the samplers, most will always return `false`.

# Batch to tensor

One of the core functions is the ability to export data into tensors that can be used for deep learning. This is done via the `to_tensor` function that converts the numerical columns into a tensor of `self:size()` x `#self:get_numerical_colnames()` size. As we frequently has some input data that we want to map onto a set of labels/targets the `Batchframe` subclass has an extension to the `to_tensor` function. There are several options where the most common is probably to load the data from an external file and matching it with one or more columns within the dataframe. Below is an example of how it looks when both the data and the labels reside in the dataframe:

“`lua
th> mtcars_df:as_categorical(“am”):head(2)

+———————————————————–+
| rownames | mpg | hp | wt | qsec | am | gear |
+———————————————————–+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 4 |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 4 |
+———————————————————–+

[0.0026s]
th> mtcars_df[“/train”]:
get_batch(3):
to_tensor{data_columns = Df_Array(“mpg”, “hp”),
label_columns = Df_Array(“am”, “gear”)}
24.4000 62.0000
16.4000 180.0000
19.7000 175.0000
[torch.DoubleTensor of size 3×2]

1 4
1 3
2 5
[torch.DoubleTensor of size 3×2]

{
1 : “mpg”
2 : “hp”
}
[0.0032s]
“`

You can substitute the `data_columns` with `load_data_fn` or the `label_columns` with `load_label_fn`. Each function receives a single row in the format of a plain table where any information can be used for generating a tensor, e.g. filename of an image. As loading files is time-consuming I often like to do this in parallel.

A convenient way is to set the `batch_args` arguments when creating the subsets where you can specify the data/label retrieving strategies:

“`lua
data:create_subsets{
data_retriever = function(row) load_img(row.filename) end},
label_retriever = Df_Array(“image_class””)
}
“`

# Summary

In this post we’ve reviewed some of the core functions for setting up the dataframe for machine learning applications such as data-splitting, subsetting and converting the data into torch-friendly tensors.

G-Forge

A blog about orthopaedic surgery, R, research and more

The torch-dataframe – subsetting and sampling

Leave a Reply Cancel reply