Sharon Pazner
](https://flic.kr/p/nSNQzw)
Handling [tabular data](https://en.wikipedia.org/wiki/Table_(information)) is generally at the heart of most research projects. As I started exploring [Torch](http://torch.ch/) that uses the [Lua](https://www.lua.org/) language for [deep learning](https://en.wikipedia.org/wiki/Deep_learning) I was surprised that there was no package that would correspond to the functionality available in R’s [data.frame](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html). After some searching I found Alex Mili’s [torch-dataframe](https://github.com/AlexMili/torch-dataframe) package that I decided to update to my needs. We have during the past few months been developing the package and it has now made it onto the Torch [cheat sheet](https://github.com/torch/torch7/wiki/Cheatsheet#data-formats) (partly the reason for the posting scarcity lately). This series of posts provide a short introduction to the package (version 1.5) and examples of how to implement basic networks in Torch.
# All posts in the *torch-dataframe* series
1. [Intro to the torch-dataframe][intro]
2. [Modifications][mods]
3. [Subsetting][subs]
4. [The mnist example][mnist ex]
5. [Multilabel classification][multilabel]
[intro]: http://gforge.se/2016/08/deep-learning-with-torch-dataframe-a-gentle-introduction-to-torch/
[mods]: http://gforge.se/2016/08/the-torch-dataframe-basics-on-modifications/
[subs]: http://gforge.se/2016/08/the-torch-dataframe-subsetting-and-sampling/
[mnist ex]: http://gforge.se/2016/08/integration-between-torchnet-and-torch-dataframe-a-closer-look-at-the-mnist-example/
[multilabel]: http://gforge.se/2016/08/setting-up-a-multilabel-classification-network-with-torch-dataframe/
# Intro
The _torch-dataframe_ package has the amazing samplers from Twitter’s [torch-dataset](https://github.com/twitter/torch-dataset) and is fully integrated with the elegant [torchnet](https://github.com/torchnet/torchnet) from Facebook. The aim is for intermediate size projects where the core data fits into memory. This does _not restrict_ your data to a single drive or computer, only your ‘csv’ file. For image classification I to store my labels and corresponding image filename in the csv-file. I only retrieve the image data after sampling a batch and the memory usage is therefore be negligible.
# Installing
You can install the package directly using standard `luarocks`:
luarocks install torch-dataframe
There is also the dev-version that you can install through cloning the package. The `develop` branch, it is generally stable as we put all the new features into sub-branches that are merged only once all the tests are cleared. You download it via:
git clone https://github.com/AlexMili/torch-dataframe cd torch-dataframe git checkout develop luarocks make rocks/torch-dataframe-scm-1.rockspec
# Reading a CSV-file
The core idea is that a CSV-file is parsed into the dataframe that allows you to then work with the data. To read a CSV-file you can simply provide its name during the constructor call (the file is a dump from R’s [mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) and available for download [here](https://gist.github.com/gforge/8b0e3551f377781e83c6c189867f149d):
require 'Dataframe' mtcars_df = Dataframe('mtcars.csv')
The loading is handled by the `load_csv` function that relies on the `csvigo` library. The data is internally stored in the `self.dataset` variable together with a some meta-data that indicates whether it is a numerical, boolean or string column.
# Quick look
We can easily display the data using the `print` that prints the first 10 rows:
print(mtcars_df)
This will print a formatted table:
+-----------------------------------------------------------------------+ | | mpg | cyl | disp | hp | drat | wt | qsec | vs | ... | +-----------------------------------------------------------------------+ | Mazd... | 21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 | ... | | Mazd... | 21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 | ... | | Dats... | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 | ... | | Horn... | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | ... | | Horn... | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 | ... | | Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 | ... | | Dust... | 14.3 | 8 | 360 | 245 | 3.21 | 3.57 | 15.84 | 0 | ... | | Merc... | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20 | 1 | ... | | Merc... | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.15 | 22.9 | 1 | ... | | Merc... | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.3 | 1 | ... | | ... | +-----------------------------------------------------------------------+ * Columns skipped: 'am', 'gear', 'carb'
Note that when the table gets wide it truncates also the columns and leaves a note at the bottom, this was inspired by R’s excellent [dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) package.
If we want to inspect two random columns we can simply write:
mtcars_df:get_random(2)
that outputs:
+-------------------------------------------------------------------------------+ | | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | ... | +-------------------------------------------------------------------------------+ | Merc... | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.9 | 1 | Auto... | ... | | Merc... | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.15 | 22.9 | 1 | Auto... | ... | +-------------------------------------------------------------------------------+ * Columns skipped: 'gear', 'carb'
# Categorical variables
A common task for deep learning is to classify images. The images are classified into groups and then the groups are converted to numbers ranging from 1 to #classes. I like to be able to look at my data and immediately see the relationship between the image and the class name. This requires converting a string label into numbers and keeping a table that maps the number to the class, using the Dataframe package it is achieved via:
mtcars_df:as_categorical('am') -- Print a subset of the columns mtcars_df:tostring{columns2skip="^[^a].*"}
the output is:
+-------------------------------+ | | am | +-------------------------------+ | Mazda RX4 | Manual | | Mazda RX4 Wag | Manual | | Datsun 710 | Manual | | Hornet 4 Drive | Automatic | | Hornet Sportabout | Automatic | | Valiant | Automatic | | Duster 360 | Automatic | | Merc 240D | Automatic | | Merc 230 | Automatic | | Merc 280 | Automatic | | ... | +-------------------------------+ * Columns skipped: 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'gear', 'carb'
The mapping between the columns can be done using the `to_categorical` or `from_categorical`:
th> mtcars_df:to_categorical{data = 1, column_name = "am"} Automatic [0.0002s] th> mtcars_df:to_categorical{data = torch.Tensor({1,2}), column_name = "am"} { 1 : "Automatic" 2 : "Manual" } [0.0004s] th> mtcars_df:from_categorical{data = 1, column_name = "am"} { 1 : nan } [0.0002s] th> mtcars_df:from_categorical{data = "Manual", column_name = "am"} { 1 : 2 } [0.0003s]
# Statistics
There are also some convenient basic descriptive statistics available. To get the value counts you can use the `value_counts`:
th> mtcars_df:value_counts('am') +-------------------+ | values | count | +-------------------+ | Automatic | 19 | | Manual | 13 | +-------------------+ [0.0008s] th> mtcars_df:value_counts("am", true) -- normalized values +---------------------+ | values | count | +---------------------+ | Manual | 0.40625 | | Automatic | 0.59375 | +---------------------+ [0.0010s]
# Help – I’m in argument hell
There are plenty of functions available at your disposal and keeping track of all the arguments is event tricky for the authors. We have therefore in addition to the [README](https://github.com/AlexMili/torch-dataframe/blob/master/README.md) also added a [doc](https://github.com/AlexMili/torch-dataframe/tree/master/doc) folder that contains the entire API. You will furthermore automatically get help if you get the inputs wrong (courtesy `argcheck`):
th> mtcars_df:drop({}) [string "argcheck"]:56: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Dataframe.drop(self, column_name) ({ self = Dataframe -- column_name = string -- The column to drop }) Delete column from dataset Return value: self or You can also delete multiple columns by supplying a Df_Array ({ self = Dataframe -- columns = Df_Array -- The columns to drop }) Got: Dataframe, table={ }
# All properties and functions
Here’s a list of the main Dataframe’s all options. This list does not include the metatable functions, subclasses or helper classes:
mtcars_df:add_cat_key() mtcars_df:is_string() mtcars_df:add_column() mtcars_df:iterator() mtcars_df:append() mtcars_df:load_csv() mtcars_df:as_categorical() mtcars_df:load_table() mtcars_df:as_string() mtcars_df.n_rows mtcars_df:assert_has_column() mtcars_df:new() mtcars_df:assert_has_not_column() mtcars_df:output() mtcars_df:assert_is_index() mtcars_df:parallel() mtcars_df:batch() mtcars_df:rbind() mtcars_df.categorical mtcars_df:remove_index() mtcars_df:cbind() mtcars_df:rename_column() mtcars_df:clean_categorical() mtcars_df:resample() mtcars_df.column_order mtcars_df:reset_column() mtcars_df.columns mtcars_df:reset_subsets() mtcars_df:copy() mtcars_df:schema. mtcars_df:count_na() mtcars_df:set() mtcars_df:create_subsets() mtcars_df:set_version() mtcars_df.dataset mtcars_df:shape() mtcars_df:drop() mtcars_df:show() mtcars_df:exec() mtcars_df:shuffle() mtcars_df:fill_all_na() mtcars_df:size() mtcars_df:fill_na() mtcars_df:split() mtcars_df:from_categorical() mtcars_df:sub() mtcars_df:get() mtcars_df:tail() mtcars_df:get_cat_keys() mtcars_df:to_categorical() mtcars_df:get_column() mtcars_df:to_csv() mtcars_df:get_column_order() mtcars_df:to_tensor() mtcars_df:get_max_value() mtcars_df:tostring() mtcars_df:get_min_value() mtcars_df:tostring_defaults. mtcars_df:get_mode() mtcars_df:transform() mtcars_df:get_numerical_colnames() mtcars_df:unique() mtcars_df:get_random() mtcars_df:update() mtcars_df:get_row() mtcars_df:upgrade_frame() mtcars_df:get_subset() mtcars_df:value_counts() mtcars_df:has_column() mtcars_df:version() mtcars_df:has_subset() mtcars_df:where() mtcars_df:head() mtcars_df:which() mtcars_df:insert() mtcars_df:which_max() mtcars_df:is_boolean() mtcars_df:which_min() mtcars_df:is_categorical() mtcars_df:wide2long() mtcars_df:is_numerical()
# Summary
The torch-dataframe package will hopefully allow you to do all the basic things that you expect from a data frame. In this post we have covered some of the core functionality for installing, loading and looking at the data. Next post will show some of the manipulations that the package provides.