Deep learning with torch-dataframe – a gentle introduction to Torch

[![A solid concrete foundation is always important. The image is cc by Sharon Pazner ](http://gforge.se/wp-content/uploads/2016/07/Lego-house-concrete.jpg)](http://gforge.se/wp-content/uploads/2016/07/Lego-house-concrete.jpg) A solid concrete foundation is always important. The image is cc by[
Sharon Pazner
](https://flic.kr/p/nSNQzw)

Handling [tabular data](https://en.wikipedia.org/wiki/Table_(information)) is generally at the heart of most research projects. As I started exploring [Torch](http://torch.ch/) that uses the [Lua](https://www.lua.org/) language for [deep learning](https://en.wikipedia.org/wiki/Deep_learning) I was surprised that there was no package that would correspond to the functionality available in R’s [data.frame](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html). After some searching I found Alex Mili’s [torch-dataframe](https://github.com/AlexMili/torch-dataframe) package that I decided to update to my needs. We have during the past few months been developing the package and it has now made it onto the Torch [cheat sheet](https://github.com/torch/torch7/wiki/Cheatsheet#data-formats) (partly the reason for the posting scarcity lately). This series of posts provide a short introduction to the package (version 1.5) and examples of how to implement basic networks in Torch.

# All posts in the *torch-dataframe* series

1. [Intro to the torch-dataframe][intro]
2. [Modifications][mods]
3. [Subsetting][subs]
4. [The mnist example][mnist ex]
5. [Multilabel classification][multilabel]

[intro]: http://gforge.se/2016/08/deep-learning-with-torch-dataframe-a-gentle-introduction-to-torch/
[mods]: http://gforge.se/2016/08/the-torch-dataframe-basics-on-modifications/
[subs]: http://gforge.se/2016/08/the-torch-dataframe-subsetting-and-sampling/
[mnist ex]: http://gforge.se/2016/08/integration-between-torchnet-and-torch-dataframe-a-closer-look-at-the-mnist-example/
[multilabel]: http://gforge.se/2016/08/setting-up-a-multilabel-classification-network-with-torch-dataframe/

# Intro

The _torch-dataframe_ package has the amazing samplers from Twitter’s [torch-dataset](https://github.com/twitter/torch-dataset) and is fully integrated with the elegant [torchnet](https://github.com/torchnet/torchnet) from Facebook. The aim is for intermediate size projects where the core data fits into memory. This does _not restrict_ your data to a single drive or computer, only your ‘csv’ file. For image classification I to store my labels and corresponding image filename in the csv-file. I only retrieve the image data after sampling a batch and the memory usage is therefore be negligible.

# Installing

You can install the package directly using standard luarocks:

There is also the dev-version that you can install through cloning the package. The develop branch, it is generally stable as we put all the new features into sub-branches that are merged only once all the tests are cleared. You download it via:

# Reading a CSV-file

The core idea is that a CSV-file is parsed into the dataframe that allows you to then work with the data. To read a CSV-file you can simply provide its name during the constructor call (the file is a dump from R’s [mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) and available for download [here](https://gist.github.com/gforge/8b0e3551f377781e83c6c189867f149d):

The loading is handled by the load_csv function that relies on the csvigo library. The data is internally stored in the self.dataset variable together with a some meta-data that indicates whether it is a numerical, boolean or string column.

# Quick look

We can easily display the data using the print that prints the first 10 rows:

This will print a formatted table:

Note that when the table gets wide it truncates also the columns and leaves a note at the bottom, this was inspired by R’s excellent [dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) package.

If we want to inspect two random columns we can simply write:

that outputs:

# Categorical variables

A common task for deep learning is to classify images. The images are classified into groups and then the groups are converted to numbers ranging from 1 to #classes. I like to be able to look at my data and immediately see the relationship between the image and the class name. This requires converting a string label into numbers and keeping a table that maps the number to the class, using the Dataframe package it is achieved via:

the output is:

The mapping between the columns can be done using the to_categorical or from_categorical:

# Statistics

There are also some convenient basic descriptive statistics available. To get the value counts you can use the value_counts:

# Help – I’m in argument hell

There are plenty of functions available at your disposal and keeping track of all the arguments is event tricky for the authors. We have therefore in addition to the [README](https://github.com/AlexMili/torch-dataframe/blob/master/README.md) also added a [doc](https://github.com/AlexMili/torch-dataframe/tree/master/doc) folder that contains the entire API. You will furthermore automatically get help if you get the inputs wrong (courtesy argcheck):

# All properties and functions

Here’s a list of the main Dataframe’s all options. This list does not include the metatable functions, subclasses or helper classes:

# Summary

The torch-dataframe package will hopefully allow you to do all the basic things that you expect from a data frame. In this post we have covered some of the core functionality for installing, loading and looking at the data. Next post will show some of the manipulations that the package provides.

Flattr this!

This entry was posted in Deep learning, Torch, Tutorial. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.