The torch-dataframe – basics on modifications

Forming your data to your needs is crucial. The image i cc by Lennart Tange.

Forming your data to your needs is crucial. The image i cc by Lennart Tange.

In my [previous post][intro post] we took a look at some of the basic functionality. In this post I’ll try to show how to manipulate your dataframe. Note though, the [torch-dataframe][tdf github] is not about data munging, there are far more powerful tools in other languages for this. The aim of the modifications is to do simple tasks without being forced to switch to a different language.

[intro post]: http://gforge.se/2016/08/deep-learning-with-torch-dataframe-a-gentle-introduction-to-torch/
[tdf github]: https://github.com/AlexMili/torch-dataframe

# All posts in the *torch-dataframe* series

1. [Intro to the torch-dataframe][intro post]
2. [Modifications][mods]
3. [Subsetting][subs]
4. [The mnist example][mnist ex]
5. [Multilabel classification][multilabel]

[mods]: http://gforge.se/2016/08/the-torch-dataframe-basics-on-modifications/
[subs]: http://gforge.se/2016/08/the-torch-dataframe-subsetting-and-sampling/
[mnist ex]: http://gforge.se/2016/08/integration-between-torchnet-and-torch-dataframe-a-closer-look-at-the-mnist-example/
[multilabel]: http://gforge.se/2016/08/setting-up-a-multilabel-classification-network-with-torch-dataframe/

We will start as last time with loading the `mtcars` dataset (se [previous post][intro post] for details):

“`lua
require ‘Dataframe’
mtcars_df = Dataframe(“mtcars.csv”)
print(mtcars)
“`

“`
+———————————————————————–+
| | mpg | cyl | disp | hp | drat | wt | qsec | vs | … |
+———————————————————————–+
| Mazd… | 21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 | … |
| Mazd… | 21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 | … |
| Dats… | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 | … |
| Horn… | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | … |
| Horn… | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 | … |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 | … |
| Dust… | 14.3 | 8 | 360 | 245 | 3.21 | 3.57 | 15.84 | 0 | … |
| Merc… | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20 | 1 | … |
| Merc… | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.15 | 22.9 | 1 | … |
| Merc… | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.3 | 1 | … |
| … |
+———————————————————————–+

* Columns skipped: ‘am’, ‘gear’, ‘carb’
“`

# Rename & drop columns

As you can see there are quite a few columns and we would like to rename the first row name column and skip some columns to make it a little easier to overview:

“`lua
th> mtcars_df:
..> rename_column(“”, “rownames”):
..> drop(Df_Array(“cyl”, “disp”, “drat”, “vs”, “carb”))

+——————————————————————-+
| rownames | mpg | hp | wt | qsec | am | gear |
+——————————————————————-+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 4 |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 4 |
| Datsun 710 | 22.8 | 93 | 2.32 | 18.61 | Manual | 4 |
| Hornet 4 Drive | 21.4 | 110 | 3.215 | 19.44 | Automatic | 3 |
| Hornet Sportabout | 18.7 | 175 | 3.44 | 17.02 | Automatic | 3 |
| Valiant | 18.1 | 105 | 3.46 | 20.22 | Automatic | 3 |
| Duster 360 | 14.3 | 245 | 3.57 | 15.84 | Automatic | 3 |
| Merc 240D | 24.4 | 62 | 3.19 | 20 | Automatic | 4 |
| Merc 230 | 22.8 | 95 | 3.15 | 22.9 | Automatic | 4 |
| Merc 280 | 19.2 | 123 | 3.44 | 18.3 | Automatic | 4 |
| … |
+——————————————————————-+

[0.0081s]
“`

As all functions return a value you can easily chain operations. The `rename_column` is hopefully rather obvious. The `drop` requires a `Df_Array` as the main argument. This wrapping of tables is a common pattern in torch-dataframe since `argcheck` otherwise will have a hard time due to [ambiguity issues](https://github.com/torch/argcheck/issues/9). The `Df_Array` is a specialized table wrapper without any depth or named keys, the `Df_Dict` is for keyed non-nested tables, e.g. `{a = 1, b = 2}`, while `Df_Tbl` doesn’t have any restrictions.

# Subsetting with `sub`

Subsetting the dataframe is done by the `sub` function:

“`lua
th> mtcars_df:sub(5,7)

+——————————————————————+
| rownames | mpg | hp | wt | qsec | am | gear |
+——————————————————————+
| Hornet Sportabout | 18.7 | 175 | 3.44 | 17.02 | Automatic | 3 |
| Valiant | 18.1 | 105 | 3.46 | 20.22 | Automatic | 3 |
| Duster 360 | 14.3 | 245 | 3.57 | 15.84 | Automatic | 3 |
+——————————————————————+

[0.0025s]
“`

Related functions are `head` and `tail` that allow you to subset `x` first/last rows. You can also use the `:`, the exact same result as above can be achieved via: `mtcars_df[“5:7”]`. The bracket can also be used for retrieving a single row (as a table):

“`lua
th> mtcars_df[5]
{
wt : 3.44
gear : 3
mpg : 18.7
rownames : “Hornet Sportabout”
am : “Automatic”
qsec : 17.02
hp : 175
}
[0.0003s]
“`

Or a column by using the `$` before column name:

“`lua
th> mtcars_df:head(3)[“$hp”]
{
1 : 110
2 : 110
3 : 93
}
“`

The more detailed column retriever is `get_column` that has more options associated with it such as if the values should be as tensors or if categorical values should be returned as integers.

# Where

If we want to select cases based on a criteria we use the where function. If we want to select all cars with automatic and inspect the first three rows we can simply write:

“`lua
th> mtcars_df:where(“am”, “Manual”):head(3)

+————————————————————+
| rownames | mpg | hp | wt | qsec | am | gear |
+————————————————————+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 4 |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 4 |
| Datsun 710 | 22.8 | 93 | 2.32 | 18.61 | Manual | 4 |
+————————————————————+

[0.0067s]
“`

We can also use complex functions to find elements of interest:

“`lua
th> mtcars_df:where(function(row) return row.wt * row.hp < 250 end) +----------------------------------------------------------------+ | rownames | mpg | hp | wt | qsec | am | gear | +----------------------------------------------------------------+ | Datsun 710 | 22.8 | 93 | 2.32 | 18.61 | Manual | 4 | | Merc 240D | 24.4 | 62 | 3.19 | 20 | Automatic | 4 | | Fiat 128 | 32.4 | 66 | 2.2 | 19.47 | Manual | 4 | | Honda Civic | 30.4 | 52 | 1.615 | 18.52 | Manual | 4 | | Toyota Corolla | 33.9 | 65 | 1.835 | 19.9 | Manual | 4 | | Toyota Corona | 21.5 | 97 | 2.465 | 20.01 | Automatic | 3 | | Fiat X1-9 | 27.3 | 66 | 1.935 | 18.9 | Manual | 4 | | Porsche 914-2 | 26 | 91 | 2.14 | 16.7 | Manual | 5 | | Lotus Europa | 30.4 | 113 | 1.513 | 16.9 | Manual | 5 | +----------------------------------------------------------------+ [0.0086s] ``` # Update Let's say we want to add a column and assign it a value, we can do `add_column` and then a `update` to populate that column. ```lua th> mtcars_df:add_column(“new_gear”):head(2)

+——————————————————————————+
| rownames | mpg | hp | wt | qsec | am | gear | new_gear |
+——————————————————————————+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 4 | nan |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 4 | nan |
+——————————————————————————+

[0.0046s]
th> mtcars_df:update(
..> function(row) return true end,
..> function(row)
..> if (row[“gear”]) then
..> row.new_gear = row[“gear”] + 1
..> end
..> return row
..> end)

+——————————————————————————+
| rownames | mpg | hp | wt | qsec | am | gear | new_gear |
+——————————————————————————+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 4 | 5 |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 4 | 5 |
| Datsun 710 | 22.8 | 93 | 2.32 | 18.61 | Manual | 4 | 5 |
| Hornet 4 Drive | 21.4 | 110 | 3.215 | 19.44 | Automatic | 3 | 4 |
| Hornet Sportabout | 18.7 | 175 | 3.44 | 17.02 | Automatic | 3 | 4 |
| Valiant | 18.1 | 105 | 3.46 | 20.22 | Automatic | 3 | 4 |
| Duster 360 | 14.3 | 245 | 3.57 | 15.84 | Automatic | 3 | 4 |
| Merc 240D | 24.4 | 62 | 3.19 | 20 | Automatic | 4 | 5 |
| Merc 230 | 22.8 | 95 | 3.15 | 22.9 | Automatic | 4 | 5 |
| Merc 280 | 19.2 | 123 | 3.44 | 18.3 | Automatic | 4 | 5 |
| … |
+——————————————————————————+

[0.0115s]
“`

# Set

The `set` function is related to update and you can even do a search set pattern:

“`lua
th> mtcars_df:
..> drop(“new_gear”):
..> set(21, “mpg”, Df_Dict{gear = 2}):head(3)

+————————————————————+
| rownames | mpg | hp | wt | qsec | am | gear |
+————————————————————+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 2 |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 2 |
| Datsun 710 | 22.8 | 93 | 2.32 | 18.61 | Manual | 4 |
+————————————————————+

[0.0039s]
“`

Although it is perhaps more intuitive to use it with a row index:

“`lua
th> mtcars_df:set(2, Df_Dict{gear = 3}):head(3)

+————————————————————+
| rownames | mpg | hp | wt | qsec | am | gear |
+————————————————————+
| Mazda RX4 | 21 | 110 | 2.62 | 16.46 | Manual | 2 |
| Mazda RX4 Wag | 21 | 110 | 2.875 | 17.02 | Manual | 3 |
| Datsun 710 | 22.8 | 93 | 2.32 | 18.61 | Manual | 4 |
+————————————————————+

[0.0025s]
“`

# Summary

In this post we have covered some of the most basic tools for selecting and updating data. The list isn’t exhaustive and I recommend that you spend some time with the [API-docs][api docs] and experiment.

[api docs]: https://github.com/AlexMili/torch-dataframe/blob/master/doc/README.md

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.