R: slicing and dicing data
Nicer formating at https://sites.google.com/site/oraclemonitor/r-slicing-and-dicing-data
- R data types
- Converting columns into vectors
- Extracting Rows and converting Rows to numeric vectors
- Entering data
- Vectorwise maximum/minimum
- Column Sums and Row Sums
R can do some awesome data visualizations: http://gallery.r-enthusiasts.com/thumbs.php
Instead of doing one off data visualizations like with Excel, R can automate the process allowing one to visualize many sets of data with the same visualizations.
Installing R is pretty easy http://scs.math.yorku.ca/index.php/R:_Getting_started_with_R
There are lots of blogs out there on getting started with R. The one thing that I didn’t find explained well was slicing and dicing data.
Lets take some data that I want to visualize. The following data shows the performance of network throughput. The throughput is measured by latency of communication in milliseconds (avg_ms) and throughput in MB per second (MB/s).
The parameters are the I/O message size in KB (0KB is actually 1 byte) and the number of concurrent threads sending data (threads)
IOsize ,threads ,avg_ms , MB/s
0 , 1 , .02 , .010
0 , 8 , .04 , .024
0 , 64 , .20 , .025
8 , 1 , .03 , 70.529
8 , 8 , .04 , 150.389
8 , 64 , .23 , 48.604
32 , 1 , .06 , 149.405
32 , 8 , .07 , 321.392
32 , 64 , .18 , 73.652
128 , 1 , .03 , 226.457
128 , 8 , .01 , 557.196
128 , 64 , .06 , 180.176
1024 , 1 , .01 , 335.587
1024 , 8 , .01 , 726.876
1024 , 64 , .02 , 714.162
If this data is a file, it can be easily loaded and charted with R.
Find out what directory R is working in:
getwd()
go to a directory with my data and R files:
setwd("C:/Users/Kyle/R")
list files
dir()
load data into a variable
mydata <- read.csv("mydata.csv")
Simple, et voila, the data is loaded. To see the data just type the name of the variable ( the “>” is the R prompt, like “SQL>” in SQL*Plus)
> mydata
IOsize threads avg_ms MB.s
1 0 1 0.02 0.010
2 0 8 0.04 0.024
3 0 64 0.20 0.025
4 8 1 0.03 70.529
5 8 8 0.04 150.389
6 8 64 0.23 48.604
7 32 1 0.06 149.405
8 32 8 0.07 321.392
9 32 64 0.18 73.652
10 128 1 0.03 226.457
11 128 8 0.01 557.196
12 128 64 0.06 180.176
13 1024 1 0.01 335.587
14 1024 8 0.01 726.876
15 1024 64 0.02 714.162
Creating a chart is a breeze, just say plot(x,y) where x and y are the values you want to plot.
How to we extract an x and y from mydata?
First pick what to plot. Let’s plot averge ms latency (avg_ms) verse MB per sec (MB.s).
Here is how to extract those columns from the data
x=mydata['avg_ms']
y=mydata['MB.s']
Now plot
> plot(x,y)
Error in stripchart.default(x1, ...) : invalid plotting method
huh … what’s that Error?
If we look at x and/or y, they are actually columns from mydata and plot() wants rows (actually vectors but we’ll get there).
> x
avg_ms
1 0.02
2 0.04
3 0.20
4 0.03
5 0.04
6 0.23
7 0.06
8 0.07
9 0.18
10 0.03
11 0.01
12 0.06
13 0.01
14 0.01
15 0.02
To transpose a column into a row we can use “t()”
> t(x)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
avg_ms 0.02 0.04 0.2 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02
Now we can try plotting again:
> plot(t(x),t(y))
and voila
but let’s address the issue of transforming x and y from columns to rows and specifically into vectors.
Let’s look at the original data and then the transformed data
x=mydata['avg_ms'] # column of data extracted from a data.frame
tx=t(mydata['avg_ms']) # transform the column of data into a row
Look at the datatypes of x and t(x) using the class() function
> class(mydata)
[1] "data.frame"
> class(x)
[1] "data.frame"
> class(tx)
[1] "matrix"
the column is considered a “data.frame” and the row is considered a “matrix”.
The method of extracting a column by it’s column name only works for datatype class data.frame.
If the datatype was a matrix we would be required to supply both the row and column as in matrix[“row”,”column”]
By leaving either row or column empty but keeping the comma in place then it acts as a wild card.
matrix[,”column”] – gives all values in that column
matrix[“row”,] – gives all the values in that row
plot() wants a vector (but it forgivingly works with rows of data as we did above).
R data types
What are these datatypes in R?
There is a simple discussion of data types at http://www.statmethods.net/input/datatypes.html
The types are basically (using “value1:value2” gives a list iterating from value1 to value2 by increments of 1)
- integer
-
> i=1:5
> class(i)
[1] "integer"
> i
[1] 1 2 3 4 5
-
- character
-
> c=letters[1:5]
> class(c)
[1] "character"
> c
[1] "a" "b" "c" "d" "e"
-
- (booleans are integers )
-
> b=FALSE:TRUE
> class(b)
[1] "integer"
> b
[1] 0 1
-
- vectors
-
> v=c(1,2,3,4,5)
> class(v)
[1] "numeric"
> v
[1] 1 2 3 4 5
-
- matrix
-
> m=matrix(c(1,2,3,4,5))
> class(m)
[1] "matrix"
> m
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
-
- data.frames – mixes numeric and character
-
> df=matrix(1:5,letters[1:5]) # matrix can't contain character and numeric
Error in matrix(1:5, letters[1:5]) : non-numeric matrix extent
>
> df=data.frame(1:5,letters[1:5]) # dataframe can
> class(df)
[1] "data.frame"
> df
X1.5 letters.1.5.
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
-
- lists – like an matrix but can mix different data types together such as character, number, matrix
-
> a = c(1,2,5.3,6,-2,4) # numeric vector > # generates 5 x 4 numeric matrix > y=matrix(1:20, nrow=5,ncol=4) > # example of a list with 4 components - > # a string, a numeric vector, a matrix, and a scaler > w= list(name="Fred", mynumbers=a, mymatrix=y, age=5.3) > w $name [1] "Fred" $mynumbers [1] 1.0 2.0 5.3 6.0 -2.0 4.0 $mymatrix [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 $age [1] 5.3
- extract the various parts of a list with list[[“name”]], as in w[[“mymatrix”]]
-
- array – are matrices with more than 2 dimensions
- factors
Useful functions on data types
- dput(var) – will give structure of var
- class(var) – will tell the data type
- dim(var) – will set dimension
- as.matrix(data.frame) – useful for changing a data.frame into a matrix, though be careful because if there are any character values in the data frame then all entries in the matrix will be charcter
Sometimes R transforms data in ways I don’t predict, but the best strategy is just to force R to do what I want more explicitly.
Converting columns into vectors
When originally selecting out the columns of the data, we could have selected out vectors directly instead of selecting a column and transforming the column to a vector.
Instead of asking for the column which gives a column we can ask for every value in that column
by adding in a “,” infront of the column name. The brackets take the equivalent of x and y coordinates or row and column position. By adding a “,” with no value before it, we are giving a wild card to the row identifier and saying give me all the values for all rows in the column “avg_ms”
x=mydata[,'avg_ms']
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02
We can also extract the values by the column position instead of column name. The “avg_ms” is column 3
> x=mydata[,3]
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02
A third way to get the vector format is using “[[ ]]” syntax
> x=mydata[[3]]
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02
A forth way is with the matrix$col syntax
> x=mydata$avg_ms
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02
Another way that we’ll talk about in converting a row to a vector is the apply() and as.numeric() functions:
The function apply can also change a column to a vector
> x=mydata['avg_ms']
> class(x)
[1] "data.frame"
> x
avg_ms
1 0.02
2 0.04
3 0.20
4 0.03
5 0.04
6 0.23
7 0.06
8 0.07
9 0.18
10 0.03
11 0.01
12 0.06
13 0.01
14 0.01
15 0.02
> x=apply(x,1,as.numeric)
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02
These vector extractions work for columns but things are different for rows.
Extracting Rows and converting Rows to numeric vectors
The other side other coin is extracting a row into vector format. In mydata, the rows don’t have names, so we have to use position. By specifying row position with no following column names then all column values are given for that row.
> row=mydata[3,]
> class(row)
[1] "data.frame"
> row
IOsize threads avg_ms MB.s
3 0 64 0.2 0.025
The resulting data is a data frame and not a vector (ie a vector is of datatype numeric)
We can use the “as.numeric” function to convert the data.frame to a vector, ie numeric.
The apply() function will apply the “as.numeric” function to multiple values at once. The apply() takes 3 args
- input variable
- 1=row,2=col,1:2=both
- function to apply
see http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
> ra=apply(row,2,as.numeric)
> class(ra)
[1] "numeric"
> ra
IOsize threads avg_ms MB.s
0.000 64.000 0.200 0.025
The above applies the change to all columns in the given row in a data.frame.
(apply can also be used for example to change all 0 to NULLs
new_matrix = apply(matrix,1:2,function(x)if (x==0) NULL else x)
see http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega)
For selecting the row out directly as a vector, the as.matrix() function can also be used
> row=as.matrix(mydata)[3,]
> class(row)
[1] "numeric"
> row
IOsize threads avg_ms MB.s
0.000 64.000 0.200 0.025
yet another way
> row=c(t(mydata[3,]))
> class(row)
[1] "numeric"
> row
[1] 0.000 64.000 0.200 0.025
( see http://stackoverflow.com/questions/2545228/converting-a-dataframe-to-a-vector-by-rows)
or yet
> row=unlist(mydata[3,])
> class(row)
[1] "numeric"
> row
IOsize threads avg_ms MB.s
0.000 64.000 0.200 0.025
Filtering Data
The data in the CSV file actually represents throughput not only at different I/O send sizes but also for different number of concurrent senders. What if I wanted to just plot the throughput by I/O send size for tests with one thread? How would I filter the data?
IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
plot(IOsize,MBs)
How about plotting the throughput by I/O size for each number of threads test?
The parameter ‘type=”o”‘ makes the plot a line plot
#extract data
IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs_1=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
MBs_8=subset(mydata[,'MB.s'],mydata['threads'] == 8 )
MBs_64=subset(mydata[,'MB.s'],mydata['threads'] == 64 )
# create graph
plot(IOsize,MBs_64,type="o")
# plot other lines
lines(IOsize,MBs_1,lty=2,col="green",type="o")
lines(IOsize,MBs_8,lty=3,col="red",type="o")
# add a legend
legend(1,700,c("1 thread","8 threads","64 threads"), cex=0.8,
col=c("green","red","black"), lty=3:1);
Entering data
Instead of entering data via a CSV file it can be entered directly into R
> m=matrix(c(
0 , 1 , .02 , .010 ,
0 , 8 , .04 , .024 ,
0 , 64 , .20 , .025 ,
8 , 1 , .03 , 70.529 ,
8 , 8 , .04 , 150.389 ,
8 , 64 , .23 , 48.604 ,
32 , 1 , .06 , 149.405 ,
32 , 8 , .07 , 321.392 ,
32 , 64 , .18 , 73.652 ,
128 , 1 , .03 , 226.457 ,
128 , 8 , .01 , 557.196 ,
128 , 64 , .06 , 180.176 ,
1024 , 1 , .01 , 335.587 ,
1024 , 8 , .01 , 726.876 ,
1024 , 64 , .02 , 714.162 ),
nrow=4,ncol=15,
dimnames=list(rows=c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
rows [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
IOsize 0.00 0.000 0.000 8.000 8.000 8.000 32.000 32.000 32.000 128.000 128.000 128.000 1024.000 1024.000 1024.000
threads 1.00 8.000 64.000 1.000 8.000 64.000 1.000 8.000 64.000 1.000 8.000 64.000 1.000 8.000 64.000
avg_ms 0.02 0.040 0.200 0.030 0.040 0.230 0.060 0.070 0.180 0.030 0.010 0.060 0.010 0.010 0.020
MB/s 0.01 0.024 0.025 70.529 150.389 48.604 149.405 321.392 73.652 226.457 557.196 180.176 335.587 726.876 714.162
> t(m)
IOsize threads avg_ms MB/s
[1,] 0 1 0.02 0.010
[2,] 0 8 0.04 0.024
[3,] 0 64 0.20 0.025
[4,] 8 1 0.03 70.529
[5,] 8 8 0.04 150.389
[6,] 8 64 0.23 48.604
[7,] 32 1 0.06 149.405
[8,] 32 8 0.07 321.392
[9,] 32 64 0.18 73.652
[10,] 128 1 0.03 226.457
[11,] 128 8 0.01 557.196
[12,] 128 64 0.06 180.176
[13,] 1024 1 0.01 335.587
[14,] 1024 8 0.01 726.876
[15,] 1024 64 0.02 714.162
The bizarre thing about this is that the nrows corresponds to the number of columns and the matrix comes out transposed. Using t() can re-transpose it, but this is all confusing.
To make it more intuitive add the argument
"byrow=TRUE,"
and add a
"NULL"
for the rowname position in the row and columns name section
m=matrix(c(
0 , 1 , .02 , .010 ,
0 , 8 , .04 , .024 ,
0 , 64 , .20 , .025 ,
8 , 1 , .03 , 70.529 ,
8 , 8 , .04 , 150.389 ,
8 , 64 , .23 , 48.604 ,
32 , 1 , .06 , 149.405 ,
32 , 8 , .07 , 321.392 ,
32 , 64 , .18 , 73.652 ,
128 , 1 , .03 , 226.457 ,
128 , 8 , .01 , 557.196 ,
128 , 64 , .06 , 180.176 ,
1024 , 1 , .01 , 335.587 ,
1024 , 8 , .01 , 726.876 ,
1024 , 64 , .02 , 714.162 ),
nrow=15,ncol=4,byrow=TRUE,
dimnames=list(NULL,c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
IOsize threads avg_ms MB/s
[1,] 0 1 0.02 0.010
[2,] 0 8 0.04 0.024
[3,] 0 64 0.20 0.025
[4,] 8 1 0.03 70.529
[5,] 8 8 0.04 150.389
[6,] 8 64 0.23 48.604
[7,] 32 1 0.06 149.405
[8,] 32 8 0.07 321.392
[9,] 32 64 0.18 73.652
[10,] 128 1 0.03 226.457
[11,] 128 8 0.01 557.196
[12,] 128 64 0.06 180.176
[13,] 1024 1 0.01 335.587
[14,] 1024 8 0.01 726.876
[15,] 1024 64 0.02 714.162
Vectorwise maximum/minimum
> lat
[1] 44.370 22.558 37.708 73.070 131.950
> std
[1] 37.7 21.6 67.1 136.1 186.0
> min
[1] 0.0 0.6 0.6 1.0 1.0
> pmax(lat-std,min)
[1] 6.670 0.958 0.600 1.000 1.000
Column Sums and Row Sums
to sum up rows or colums use “rowSums()” and “colSUms()”
http://stat.ethz.ch/R-manual/R-patched/library/base/html/colSums.html
For more info
for more info on data types and manipulation see
see: http://cran.r-project.org/doc/manuals/R-intro.html
Trackbacks
Comments