DB Optimizer » R: slicing and dicing data

R: slicing and dicing data

January 4th, 2013

Nicer formating at https://sites.google.com/site/oraclemonitor/r-slicing-and-dicing-data

R can do some awesome data visualizations: http://gallery.r-enthusiasts.com/thumbs.php

Instead of doing one off data visualizations like with Excel, R can automate the process allowing one to visualize many sets of data with the same visualizations.

Installing R is pretty easy http://scs.math.yorku.ca/index.php/R:_Getting_started_with_R

There are lots of blogs out there on getting started with R. The one thing that I didn’t find explained well was slicing and dicing data.

Lets take some data that I want to visualize. The following data shows the performance of network throughput. The throughput is measured by latency of communication in milliseconds (avg_ms) and throughput in MB per second (MB/s).

The parameters are the I/O message size in KB (0KB is actually 1 byte) and the number of concurrent threads sending data (threads)

IOsize ,threads ,avg_ms ,    MB/s
     0 ,      1 ,   .02 ,    .010
     0 ,      8 ,   .04 ,    .024
     0 ,     64 ,   .20 ,    .025
     8 ,      1 ,   .03 , 70.529
     8 ,      8 ,   .04 , 150.389
     8 ,     64 ,   .23 , 48.604
    32 ,      1 ,   .06 , 149.405
    32 ,      8 ,   .07 , 321.392
    32 ,     64 ,   .18 , 73.652
   128 ,      1 ,   .03 , 226.457
   128 ,      8 ,   .01 , 557.196
   128 ,     64 ,   .06 , 180.176
1024 ,      1 ,   .01 , 335.587
1024 ,      8 ,   .01 , 726.876
1024 ,     64 ,   .02 , 714.162

If this data is a file, it can be easily loaded and charted with R.

Find out what directory R is working in:

getwd()

go to a directory with my data and R files:

setwd("C:/Users/Kyle/R")

list files

dir()

load data into a variable

mydata <- read.csv("mydata.csv")

Simple, et voila, the data is loaded. To see the data just type the name of the variable ( the “>” is the R prompt, like “SQL>” in SQL*Plus)

> mydata
   IOsize threads avg_ms    MB.s
1       0       1   0.02   0.010
2       0       8   0.04   0.024
3       0      64   0.20   0.025
4       8       1   0.03 70.529
5       8       8   0.04 150.389
6       8      64   0.23 48.604
7      32       1   0.06 149.405
8      32       8   0.07 321.392
9      32      64   0.18 73.652
10    128       1   0.03 226.457
11    128       8   0.01 557.196
12    128      64   0.06 180.176
13   1024       1   0.01 335.587
14   1024       8   0.01 726.876
15   1024      64   0.02 714.162

Creating a chart is a breeze, just say plot(x,y) where x and y are the values you want to plot.
How to we extract an x and y from mydata?
First pick what to plot. Let’s plot averge ms latency (avg_ms) verse MB per sec (MB.s).
Here is how to extract those columns from the data

x=mydata['avg_ms']
y=mydata['MB.s']

Now plot

> plot(x,y)
Error in stripchart.default(x1, ...) : invalid plotting method

huh … what’s that Error?

If we look at x and/or y, they are actually columns from mydata and plot() wants rows (actually vectors but we’ll get there).

> x
   avg_ms
1    0.02
2    0.04
3    0.20
4    0.03
5    0.04
6    0.23
7    0.06
8    0.07
9    0.18
10   0.03
11   0.01
12   0.06
13   0.01
14   0.01
15   0.02

To transpose a column into a row we can use “t()”

> t(x)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
avg_ms 0.02 0.04 0.2 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

Now we can try plotting again:

> plot(t(x),t(y))

and voila

but let’s address the issue of transforming x and y from columns to rows and specifically into vectors.
Let’s look at the original data and then the transformed data

x=mydata['avg_ms'] # column of data extracted from a data.frame
tx=t(mydata['avg_ms']) # transform the column of data into a row

Look at the datatypes of x and t(x) using the class() function

> class(mydata)

[1] "data.frame"

> class(x)
[1] "data.frame"
> class(tx)
[1] "matrix"

the column is considered a “data.frame” and the row is considered a “matrix”.

The method of extracting a column by it’s column name only works for datatype class data.frame.

If the datatype was a matrix we would be required to supply both the row and column as in matrix[“row”,”column”]

By leaving either row or column empty but keeping the comma in place then it acts as a wild card.

matrix[,”column”] – gives all values in that column

matrix[“row”,] – gives all the values in that row

plot() wants a vector (but it forgivingly works with rows of data as we did above).

R data types

What are these datatypes in R?
There is a simple discussion of data types at http://www.statmethods.net/input/datatypes.html

The types are basically (using “value1:value2” gives a list iterating from value1 to value2 by increments of 1)

integer
- > i=1:5
  > class(i)
  [1] "integer"
  > i
  [1] 1 2 3 4 5
character
- > c=letters[1:5]
  > class(c)
  [1] "character"
  > c
  [1] "a" "b" "c" "d" "e"
(booleans are integers )
- > b=FALSE:TRUE
  > class(b)
  [1] "integer"
  > b
  [1] 0 1
vectors
- > v=c(1,2,3,4,5)
  > class(v)
  [1] "numeric"
  > v
  [1] 1 2 3 4 5
matrix
- > m=matrix(c(1,2,3,4,5))
  > class(m)
  [1] "matrix"
  > m
       [,1]
  [1,]    1
  [2,]    2
  [3,]    3
  [4,]    4
  [5,]    5
data.frames – mixes numeric and character
- > df=matrix(1:5,letters[1:5])      # matrix can't contain character and numeric Error in matrix(1:5, letters[1:5]) : non-numeric matrix extent > > df=data.frame(1:5,letters[1:5]) # dataframe can
  > class(df)
  [1] "data.frame"
  > df
  X1.5 letters.1.5.
  1    1            a
  2    2            b
  3    3            c
  4    4            d
  5    5            e

lists – like an matrix but can mix different data types together such as character, number, matrix

 > a = c(1,2,5.3,6,-2,4) # numeric vector
> # generates 5 x 4 numeric matrix 
> y=matrix(1:20, nrow=5,ncol=4)
> # example of a list with 4 components - 
> # a string, a numeric vector, a matrix, and a scaler 
> w= list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
> w
$name
[1] "Fred"

$mynumbers
[1]  1.0  2.0  5.3  6.0 -2.0  4.0

$mymatrix
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

$age
[1] 5.3

extract the various parts of a list with list[[“name”]], as in w[[“mymatrix”]]

array – are matrices with more than 2 dimensions
factors

Useful functions on data types

dput(var) – will give structure of var
class(var) – will tell the data type
dim(var) – will set dimension
as.matrix(data.frame) – useful for changing a data.frame into a matrix, though be careful because if there are any character values in the data frame then all entries in the matrix will be charcter

Sometimes R transforms data in ways I don’t predict, but the best strategy is just to force R to do what I want more explicitly.

Converting columns into vectors

When originally selecting out the columns of the data, we could have selected out vectors directly instead of selecting a column and transforming the column to a vector.
Instead of asking for the column which gives a column we can ask for every value in that column
by adding in a “,” infront of the column name. The brackets take the equivalent of x and y coordinates or row and column position. By adding a “,” with no value before it, we are giving a wild card to the row identifier and saying give me all the values for all rows in the column “avg_ms”

x=mydata[,'avg_ms']
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

We can also extract the values by the column position instead of column name. The “avg_ms” is column 3

> x=mydata[,3]
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A third way to get the vector format is using “[[ ]]” syntax

> x=mydata[[3]]
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A forth way is with the matrix$col syntax

> x=mydata$avg_ms
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

Another way that we’ll talk about in converting a row to a vector is the apply() and as.numeric() functions:
The function apply can also change a column to a vector

> x=mydata['avg_ms']
> class(x)
[1] "data.frame"
> x
   avg_ms
1    0.02
2    0.04
3    0.20
4    0.03
5    0.04
6    0.23
7    0.06
8    0.07
9    0.18
10   0.03
11   0.01
12   0.06
13   0.01
14   0.01
15   0.02
> x=apply(x,1,as.numeric)
> class(x)
[1] "numeric"

> x

                [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

These vector extractions work for columns but things are different for rows.

Extracting Rows and converting Rows to numeric vectors

The other side other coin is extracting a row into vector format. In mydata, the rows don’t have names, so we have to use position. By specifying row position with no following column names then all column values are given for that row.

> row=mydata[3,]
> class(row)
[1] "data.frame"
> row
IOsize threads avg_ms MB.s
3 0 64 0.2 0.025

The resulting data is a data frame and not a vector (ie a vector is of datatype numeric)
We can use the “as.numeric” function to convert the data.frame to a vector, ie numeric.
The apply() function will apply the “as.numeric” function to multiple values at once. The apply() takes 3 args

input variable
1=row,2=col,1:2=both
function to apply

see http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

> ra=apply(row,2,as.numeric)
> class(ra)
[1] "numeric"
> ra
IOsize threads avg_ms MB.s
0.000 64.000 0.200 0.025

The above applies the change to all columns in the given row in a data.frame.

(apply can also be used for example to change all 0 to NULLs

new_matrix = apply(matrix,1:2,function(x)if (x==0) NULL else x)

see http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega)

For selecting the row out directly as a vector, the as.matrix() function can also be used

> row=as.matrix(mydata)[3,]
> class(row)
[1] "numeric"
> row
IOsize threads avg_ms MB.s
0.000 64.000 0.200 0.025

yet another way

> row=c(t(mydata[3,]))
> class(row)
[1] "numeric"
> row
[1] 0.000 64.000 0.200 0.025

( see http://stackoverflow.com/questions/2545228/converting-a-dataframe-to-a-vector-by-rows)

or yet

> row=unlist(mydata[3,])
> class(row)
[1] "numeric"
> row
IOsize threads avg_ms MB.s
0.000 64.000 0.200 0.025

Filtering Data

The data in the CSV file actually represents throughput not only at different I/O send sizes but also for different number of concurrent senders. What if I wanted to just plot the throughput by I/O send size for tests with one thread? How would I filter the data?

IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
plot(IOsize,MBs)

How about plotting the throughput by I/O size for each number of threads test?
The parameter ‘type=”o”‘ makes the plot a line plot

#extract data

IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs_1=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
MBs_8=subset(mydata[,'MB.s'],mydata['threads'] == 8 )
MBs_64=subset(mydata[,'MB.s'],mydata['threads'] == 64 )

# create graph

plot(IOsize,MBs_64,type="o")

# plot other lines

lines(IOsize,MBs_1,lty=2,col="green",type="o")
lines(IOsize,MBs_8,lty=3,col="red",type="o")

# add a legend
legend(1,700,c("1 thread","8 threads","64 threads"), cex=0.8,
col=c("green","red","black"), lty=3:1);

Entering data

Instead of entering data via a CSV file it can be entered directly into R

> m=matrix(c(
     0 ,      1 , .02 ,    .010 ,
     0 ,      8 , .04 ,    .024 ,
     0 ,     64 , .20 ,    .025 ,
     8 ,      1 , .03 , 70.529 ,
     8 ,      8 , .04 , 150.389 ,
     8 ,     64 , .23 , 48.604 ,
    32 ,      1 , .06 , 149.405 ,
    32 ,      8 , .07 , 321.392 ,
    32 ,     64 , .18 , 73.652 ,
   128 ,      1 , .03 , 226.457 ,
   128 ,      8 , .01 , 557.196 ,
   128 ,     64 , .06 , 180.176 ,
1024 ,      1 , .01 , 335.587 ,
1024 ,      8 , .01 , 726.876 ,
1024 ,     64 , .02 , 714.162 ),
nrow=4,ncol=15,
dimnames=list(rows=c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m

rows      [,1]  [,2]   [,3]   [,4]    [,5]   [,6]    [,7]    [,8]   [,9]   [,10]   [,11]   [,12]    [,13]    [,14]    [,15]

  IOsize  0.00 0.000  0.000  8.000   8.000  8.000  32.000  32.000 32.000 128.000 128.000 128.000 1024.000 1024.000 1024.000

  threads 1.00 8.000 64.000  1.000   8.000 64.000   1.000   8.000 64.000   1.000   8.000  64.000    1.000    8.000   64.000

  avg_ms  0.02 0.040  0.200  0.030   0.040  0.230   0.060   0.070  0.180   0.030   0.010   0.060    0.010    0.010    0.020

  MB/s    0.01 0.024  0.025 70.529 150.389 48.604 149.405 321.392 73.652 226.457 557.196 180.176  335.587  726.876  714.162

> t(m)
        IOsize threads avg_ms    MB/s
   [1,]      0       1   0.02   0.010
   [2,]      0       8   0.04   0.024
   [3,]      0      64   0.20   0.025
   [4,]      8       1   0.03 70.529
   [5,]      8       8   0.04 150.389
   [6,]      8      64   0.23 48.604
   [7,]     32       1   0.06 149.405
   [8,]     32       8   0.07 321.392
   [9,]     32      64   0.18 73.652
[10,]    128       1   0.03 226.457
[11,]    128       8   0.01 557.196
[12,]    128      64   0.06 180.176
[13,]   1024       1   0.01 335.587
[14,]   1024       8   0.01 726.876
[15,]   1024      64   0.02 714.162

The bizarre thing about this is that the nrows corresponds to the number of columns and the matrix comes out transposed. Using t() can re-transpose it, but this is all confusing.
To make it more intuitive add the argument
"byrow=TRUE,"
and add a
"NULL"
for the rowname position in the row and columns name section

m=matrix(c(
     0 ,      1 , .02 ,    .010 ,
     0 ,      8 , .04 ,    .024 ,
     0 ,     64 , .20 ,    .025 ,
     8 ,      1 , .03 , 70.529 ,
     8 ,      8 , .04 , 150.389 ,
     8 ,     64 , .23 , 48.604 ,
    32 ,      1 , .06 , 149.405 ,
    32 ,      8 , .07 , 321.392 ,
    32 ,     64 , .18 , 73.652 ,
   128 ,      1 , .03 , 226.457 ,
   128 ,      8 , .01 , 557.196 ,
   128 ,     64 , .06 , 180.176 ,
1024 ,      1 , .01 , 335.587 ,
1024 ,      8 , .01 , 726.876 ,
1024 ,     64 , .02 , 714.162 ),
nrow=15,ncol=4,byrow=TRUE,
dimnames=list(NULL,c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
     IOsize threads avg_ms    MB/s
[1,]      0       1   0.02   0.010
[2,]      0       8   0.04   0.024
[3,]      0      64   0.20   0.025
[4,]      8       1   0.03 70.529
[5,]      8       8   0.04 150.389
[6,]      8      64   0.23 48.604
[7,]     32       1   0.06 149.405
[8,]     32       8   0.07 321.392
[9,]     32      64   0.18 73.652
[10,]    128       1   0.03 226.457
[11,]    128       8   0.01 557.196
[12,]    128      64   0.06 180.176
[13,]   1024       1   0.01 335.587
[14,]   1024       8   0.01 726.876
[15,]   1024      64   0.02 714.162

Vectorwise maximum/minimum

Another issues is trying to get the max or min of two or more values on a point by point basis.

Using the “min()” function gives a single minimum and not a minimum on a point by point basis.

Use “pmax()” and “pmin()” to get point by point max and min of two or more vectors.

> lat
[1] 44.370 22.558 37.708 73.070 131.950
> std
[1] 37.7 21.6 67.1 136.1 186.0
> min
[1] 0.0 0.6 0.6 1.0 1.0
> pmax(lat-std,min)
[1] 6.670 0.958 0.600 1.000 1.000