Deep learning with torch-dataframe – a gentle introduction to Torch

[![A solid concrete foundation is always important. The image is cc by Sharon Pazner ](http://gforge.se/wp-content/uploads/2016/07/Lego-house-concrete.jpg)](http://gforge.se/wp-content/uploads/2016/07/Lego-house-concrete.jpg) A solid concrete foundation is always important. The image is cc by[
Sharon Pazner
](https://flic.kr/p/nSNQzw)

Handling [tabular data](https://en.wikipedia.org/wiki/Table_(information)) is generally at the heart of most research projects. As I started exploring [Torch](http://torch.ch/) that uses the [Lua](https://www.lua.org/) language for [deep learning](https://en.wikipedia.org/wiki/Deep_learning) I was surprised that there was no package that would correspond to the functionality available in R’s [data.frame](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html). After some searching I found Alex Mili’s [torch-dataframe](https://github.com/AlexMili/torch-dataframe) package that I decided to update to my needs. We have during the past few months been developing the package and it has now made it onto the Torch [cheat sheet](https://github.com/torch/torch7/wiki/Cheatsheet#data-formats) (partly the reason for the posting scarcity lately). This series of posts provide a short introduction to the package (version 1.5) and examples of how to implement basic networks in Torch.

# All posts in the *torch-dataframe* series

1. [Intro to the torch-dataframe][intro]
2. [Modifications][mods]
3. [Subsetting][subs]
4. [The mnist example][mnist ex]
5. [Multilabel classification][multilabel]

[intro]: http://gforge.se/2016/08/deep-learning-with-torch-dataframe-a-gentle-introduction-to-torch/
[mods]: http://gforge.se/2016/08/the-torch-dataframe-basics-on-modifications/
[subs]: http://gforge.se/2016/08/the-torch-dataframe-subsetting-and-sampling/
[mnist ex]: http://gforge.se/2016/08/integration-between-torchnet-and-torch-dataframe-a-closer-look-at-the-mnist-example/
[multilabel]: http://gforge.se/2016/08/setting-up-a-multilabel-classification-network-with-torch-dataframe/

# Intro

The _torch-dataframe_ package has the amazing samplers from Twitter’s [torch-dataset](https://github.com/twitter/torch-dataset) and is fully integrated with the elegant [torchnet](https://github.com/torchnet/torchnet) from Facebook. The aim is for intermediate size projects where the core data fits into memory. This does _not restrict_ your data to a single drive or computer, only your ‘csv’ file. For image classification I to store my labels and corresponding image filename in the csv-file. I only retrieve the image data after sampling a batch and the memory usage is therefore be negligible.

# Installing

You can install the package directly using standard `luarocks`:

luarocks install torch-dataframe

There is also the dev-version that you can install through cloning the package. The `develop` branch, it is generally stable as we put all the new features into sub-branches that are merged only once all the tests are cleared. You download it via:

git clone https://github.com/AlexMili/torch-dataframe  
cd torch-dataframe 
git checkout develop 
luarocks make rocks/torch-dataframe-scm-1.rockspec

# Reading a CSV-file

The core idea is that a CSV-file is parsed into the dataframe that allows you to then work with the data. To read a CSV-file you can simply provide its name during the constructor call (the file is a dump from R’s [mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) and available for download [here](https://gist.github.com/gforge/8b0e3551f377781e83c6c189867f149d):

require 'Dataframe' 
mtcars_df = Dataframe('mtcars.csv')

The loading is handled by the `load_csv` function that relies on the `csvigo` library. The data is internally stored in the `self.dataset` variable together with a some meta-data that indicates whether it is a numerical, boolean or string column.

# Quick look

We can easily display the data using the `print` that prints the first 10 rows:

print(mtcars_df)

This will print a formatted table:

+-----------------------------------------------------------------------+
|         |  mpg | cyl |  disp |  hp | drat |    wt |  qsec | vs | ...  |
+-----------------------------------------------------------------------+
| Mazd... |   21 |   6 |   160 | 110 |  3.9 |  2.62 | 16.46 |  0 | ...  |
| Mazd... |   21 |   6 |   160 | 110 |  3.9 | 2.875 | 17.02 |  0 | ...  |
| Dats... | 22.8 |   4 |   108 |  93 | 3.85 |  2.32 | 18.61 |  1 | ...  |
| Horn... | 21.4 |   6 |   258 | 110 | 3.08 | 3.215 | 19.44 |  1 | ...  |
| Horn... | 18.7 |   8 |   360 | 175 | 3.15 |  3.44 | 17.02 |  0 | ...  |
| Valiant | 18.1 |   6 |   225 | 105 | 2.76 |  3.46 | 20.22 |  1 | ...  |
| Dust... | 14.3 |   8 |   360 | 245 | 3.21 |  3.57 | 15.84 |  0 | ...  |
| Merc... | 24.4 |   4 | 146.7 |  62 | 3.69 |  3.19 |    20 |  1 | ...  |
| Merc... | 22.8 |   4 | 140.8 |  95 | 3.92 |  3.15 |  22.9 |  1 | ...  |
| Merc... | 19.2 |   6 | 167.6 | 123 | 3.92 |  3.44 |  18.3 |  1 | ...  |
| ...                                                                   |
+-----------------------------------------------------------------------+

 * Columns skipped: 'am', 'gear', 'carb'

Note that when the table gets wide it truncates also the columns and leaves a note at the bottom, this was inspired by R’s excellent [dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) package.

If we want to inspect two random columns we can simply write:

mtcars_df:get_random(2)

that outputs:

+-------------------------------------------------------------------------------+
|         |  mpg | cyl |  disp |  hp | drat |   wt | qsec | vs | am      | ...  |
+-------------------------------------------------------------------------------+
| Merc... | 17.8 |   6 | 167.6 | 123 | 3.92 | 3.44 | 18.9 |  1 | Auto... | ...  |
| Merc... | 22.8 |   4 | 140.8 |  95 | 3.92 | 3.15 | 22.9 |  1 | Auto... | ...  |
+-------------------------------------------------------------------------------+

 * Columns skipped: 'gear', 'carb'

# Categorical variables

A common task for deep learning is to classify images. The images are classified into groups and then the groups are converted to numbers ranging from 1 to #classes. I like to be able to look at my data and immediately see the relationship between the image and the class name. This requires converting a string label into numbers and keeping a table that maps the number to the class, using the Dataframe package it is achieved via:

mtcars_df:as_categorical('am')
-- Print a subset of the columns
mtcars_df:tostring{columns2skip="^[^a].*"}

the output is:

+-------------------------------+
|                   |        am |
+-------------------------------+
| Mazda RX4         |    Manual |
| Mazda RX4 Wag     |    Manual |
| Datsun 710        |    Manual |
| Hornet 4 Drive    | Automatic |
| Hornet Sportabout | Automatic |
| Valiant           | Automatic |
| Duster 360        | Automatic |
| Merc 240D         | Automatic |
| Merc 230          | Automatic |
| Merc 280          | Automatic |
| ...                           |
+-------------------------------+

 * Columns skipped: 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'gear', 'carb'

The mapping between the columns can be done using the `to_categorical` or `from_categorical`:

th> mtcars_df:to_categorical{data = 1, column_name = "am"}
Automatic   
                                                                      [0.0002s] 
th> mtcars_df:to_categorical{data = torch.Tensor({1,2}), column_name = "am"}
{
  1 : "Automatic"
  2 : "Manual"
}
                                                                      [0.0004s] 
th> mtcars_df:from_categorical{data = 1, column_name = "am"}
{
  1 : nan
}
                                                                      [0.0002s] 
th> mtcars_df:from_categorical{data = "Manual", column_name = "am"}
{
  1 : 2
}
                                                                      [0.0003s]

# Statistics

There are also some convenient basic descriptive statistics available. To get the value counts you can use the `value_counts`:

th> mtcars_df:value_counts('am')
 
+-------------------+
| values    | count |
+-------------------+
| Automatic |    19 |
| Manual    |    13 |
+-------------------+
 
                                                                      [0.0008s]
th> mtcars_df:value_counts("am", true) -- normalized values
 
+---------------------+
| values    |   count |
+---------------------+
| Manual    | 0.40625 |
| Automatic | 0.59375 |
+---------------------+
 
                                                                      [0.0010s]

# Help – I’m in argument hell

There are plenty of functions available at your disposal and keeping track of all the arguments is event tricky for the authors. We have therefore in addition to the [README](https://github.com/AlexMili/torch-dataframe/blob/master/README.md) also added a [doc](https://github.com/AlexMili/torch-dataframe/tree/master/doc) folder that contains the entire API. You will furthermore automatically get help if you get the inputs wrong (courtesy `argcheck`):

th> mtcars_df:drop({})
[string "argcheck"]:56: 
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  Dataframe.drop(self, column_name)
 
   ({
      self        = Dataframe  -- 
      column_name = string     -- The column to drop
   })
 
   Delete column from dataset
 
   Return value: self
 
   or
 
   You can also delete multiple columns by supplying a Df_Array
 
   ({
      self    = Dataframe  -- 
      columns = Df_Array   -- The columns to drop
   })
 
   Got: Dataframe, table={ }

# All properties and functions

Here’s a list of the main Dataframe’s all options. This list does not include the metatable functions, subclasses or helper classes:

mtcars_df:add_cat_key()             mtcars_df:is_string()
mtcars_df:add_column()              mtcars_df:iterator()
mtcars_df:append()                  mtcars_df:load_csv()
mtcars_df:as_categorical()          mtcars_df:load_table()
mtcars_df:as_string()               mtcars_df.n_rows 
mtcars_df:assert_has_column()       mtcars_df:new()
mtcars_df:assert_has_not_column()   mtcars_df:output()
mtcars_df:assert_is_index()         mtcars_df:parallel()
mtcars_df:batch()                   mtcars_df:rbind()
mtcars_df.categorical               mtcars_df:remove_index()
mtcars_df:cbind()                   mtcars_df:rename_column()
mtcars_df:clean_categorical()       mtcars_df:resample()
mtcars_df.column_order              mtcars_df:reset_column()
mtcars_df.columns                   mtcars_df:reset_subsets()
mtcars_df:copy()                    mtcars_df:schema.
mtcars_df:count_na()                mtcars_df:set()
mtcars_df:create_subsets()          mtcars_df:set_version()
mtcars_df.dataset                   mtcars_df:shape()
mtcars_df:drop()                    mtcars_df:show()
mtcars_df:exec()                    mtcars_df:shuffle()
mtcars_df:fill_all_na()             mtcars_df:size()
mtcars_df:fill_na()                 mtcars_df:split()
mtcars_df:from_categorical()        mtcars_df:sub()
mtcars_df:get()                     mtcars_df:tail()
mtcars_df:get_cat_keys()            mtcars_df:to_categorical()
mtcars_df:get_column()              mtcars_df:to_csv()
mtcars_df:get_column_order()        mtcars_df:to_tensor()
mtcars_df:get_max_value()           mtcars_df:tostring()
mtcars_df:get_min_value()           mtcars_df:tostring_defaults.
mtcars_df:get_mode()                mtcars_df:transform()
mtcars_df:get_numerical_colnames()  mtcars_df:unique()
mtcars_df:get_random()              mtcars_df:update()
mtcars_df:get_row()                 mtcars_df:upgrade_frame()
mtcars_df:get_subset()              mtcars_df:value_counts()
mtcars_df:has_column()              mtcars_df:version()
mtcars_df:has_subset()              mtcars_df:where()
mtcars_df:head()                    mtcars_df:which()
mtcars_df:insert()                  mtcars_df:which_max()
mtcars_df:is_boolean()              mtcars_df:which_min()
mtcars_df:is_categorical()          mtcars_df:wide2long()
mtcars_df:is_numerical()            

# Summary

The torch-dataframe package will hopefully allow you to do all the basic things that you expect from a data frame. In this post we have covered some of the core functionality for installing, loading and looking at the data. Next post will show some of the manipulations that the package provides.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.