Archive

Archive for the ‘Uncategorized’ Category

NoCOUG 21 Feb 2013 at Oracle: Big Data and NoSQL

January 30th, 2013

This past year I joined up on the board of the Northern California Users Group (NoCOUG) and am excited to help the NoCOUG team bring great conferenes to the Bay Area.

Our new conference director Ben Prusinski has scheduled a whole slew of NoSQL and Big Data presentations at Conference 105 on February 21. Has NoCOUG abandoned SQL? You can bet your bottom dollar it hasn’t! In fact, it was Oracle’s own idea to have so many NoSQL and Big Data presentations at Conference 105. Dave Rubin, Director of NoSQL Database Development at Oracle will explain why NoSQL Database and Oracle Database are the perfect match. They are not mortal enemies. They complete and complement each other. They shine in different use cases.

But why Big Data? Can’t Oracle Database handle gynormous amounts of data? Sure it can but, once again, it’s all about using the right tool for the job. Hadoop and Oracle are not mortal enemies. They complete and complement each other. They shine in different use cases.

If you’re a die-hard relational purist and refuse to look NoSQL and Big Data in the eye then have no fear for Ben has scheduled plenty of traditional sessions for Oracle developers as well as database administrators.

We’re looking forward to another awesome NoCOUG conference with something for everybody.

Click here to check the agenda and RSVP.

Kindest regards,

The hard-working board and volunteers of NoCOUG

REGISTER HERE: http://www.nocoug.org/rsvp.html

Auditorium Room 102 Room 103
Registration and Continental Breakfast Thursday 8:00 to 9:00 Registration and Continental Breakfast Thursday 8:00 to 9:00
President Welcome and General Session Thursday 9:00 to 9:30 President Welcome and General Session Thursday 9:00 to 9:30
Thursday 9:30 to 10:30
Thursday 9:30 to 10:30
Thursday 11:00 to 12:00
Big Data: The Big Story
Jean-Pierre Dijcks
Oracle Corp
Topic
NoSQL
Understanding SQLTXPLAIN (SQLT) main report by nav…
Carlos Sierra
Oracle Corp
Topic
Developer
Thursday 11:00 to 12:00
Thursday 13:00 to 14:00
The Sins of SQL Programming that Send the DB to Po…
Abel Macias
Oracle Corp
Topic
Developer
Thursday 13:00 to 14:00
Thursday 14:30 to 15:30
Reduce Database Latency
Josh Lyford
Whiptail Storage
Topic
DBA
Advanced SQL Injection techniques
Slavik Markovich
McAfee
Topic
Developer
Thursday 14:30 to 15:30
Thursday 16:00 to 17:00
Looney Tuner? No, there IS a method to my madness…
Janis Griffin
Confio Software
Topic
Developer
Thursday 16:00 to 17:00

Uncategorized

R: slicing and dicing data

January 4th, 2013

Nicer formating at https://sites.google.com/site/oraclemonitor/r-slicing-and-dicing-data

  1. R data types
  2. Converting columns into vectors
  3. Extracting Rows and converting Rows to numeric vectors
  4. Entering data
  5. Vectorwise maximum/minimum
  6. Column Sums and Row Sums

R can do some awesome data visualizations: http://gallery.r-enthusiasts.com/thumbs.php

Instead of doing one off data visualizations like with Excel, R can automate the process allowing one to visualize many sets of data with the same visualizations.

Installing R is pretty easy http://scs.math.yorku.ca/index.php/R:_Getting_started_with_R

There are lots of blogs out there on getting started with R. The one thing that I didn’t find explained well was slicing and dicing data.

Lets take some data that I want to visualize.  The following data shows the performance of network throughput. The throughput is measured by latency of communication in milliseconds (avg_ms) and throughput in MB per second (MB/s).

The parameters are the I/O message size in KB (0KB is actually 1 byte) and the number of concurrent threads sending data (threads)

IOsize ,threads ,avg_ms ,    MB/s
     0 ,      1 ,   .02 ,    .010
     0 ,      8 ,   .04 ,    .024
     0 ,     64 ,   .20 ,    .025
     8 ,      1 ,   .03 ,  70.529
     8 ,      8 ,   .04 , 150.389
     8 ,     64 ,   .23 ,  48.604
    32 ,      1 ,   .06 , 149.405
    32 ,      8 ,   .07 , 321.392
    32 ,     64 ,   .18 ,  73.652
   128 ,      1 ,   .03 , 226.457
   128 ,      8 ,   .01 , 557.196
   128 ,     64 ,   .06 , 180.176
  1024 ,      1 ,   .01 , 335.587
  1024 ,      8 ,   .01 , 726.876
  1024 ,     64 ,   .02 , 714.162

If this data is a file, it can be easily loaded and charted with R.

Find out what directory R is working in:

getwd()

go to a directory with my data and R files:

setwd("C:/Users/Kyle/R")

list files

dir()

load data into a variable

mydata <- read.csv("mydata.csv")

Simple, et voila, the data is loaded. To see the data just type the name of the variable ( the “>” is the R prompt, like “SQL>” in SQL*Plus)

> mydata
   IOsize threads avg_ms    MB.s
1       0       1   0.02   0.010
2       0       8   0.04   0.024
3       0      64   0.20   0.025
4       8       1   0.03  70.529
5       8       8   0.04 150.389
6       8      64   0.23  48.604
7      32       1   0.06 149.405
8      32       8   0.07 321.392
9      32      64   0.18  73.652
10    128       1   0.03 226.457
11    128       8   0.01 557.196
12    128      64   0.06 180.176
13   1024       1   0.01 335.587
14   1024       8   0.01 726.876
15   1024      64   0.02 714.162

Creating a chart is a breeze, just say plot(x,y) where x and y are the values you want to plot.
How to we extract an x and y from mydata?
First pick what to plot. Let’s plot averge ms latency (avg_ms) verse MB per sec (MB.s).
Here is how to extract those columns from the data

x=mydata['avg_ms']
y=mydata['MB.s']

Now plot

> plot(x,y)
Error in stripchart.default(x1, ...) : invalid plotting method

huh … what’s that Error?

If we look at x and/or y, they are actually columns from mydata and plot() wants rows (actually vectors but we’ll get there).

> x
   avg_ms
1    0.02
2    0.04
3    0.20
4    0.03
5    0.04
6    0.23
7    0.06
8    0.07
9    0.18
10   0.03
11   0.01
12   0.06
13   0.01
14   0.01
15   0.02

To transpose a column into a row we can use “t()”

> t(x)
       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
avg_ms 0.02 0.04  0.2 0.03 0.04 0.23 0.06 0.07 0.18  0.03  0.01  0.06  0.01  0.01  0.02

Now we can try plotting again:

> plot(t(x),t(y))

and voila

but let’s address the issue of transforming x and y from columns to rows and specifically into vectors.
Let’s look at the original data and then the transformed data

x=mydata['avg_ms']     #  column of data extracted from a data.frame
tx=t(mydata['avg_ms']) #  transform the column of data into a row

Look at the datatypes of x and t(x) using the class() function

> class(mydata)
[1] "data.frame"
> class(x)
[1] "data.frame"
> class(tx)
[1] "matrix"

the column is considered a “data.frame” and the row is considered a “matrix”.

The method of extracting a column by it’s column name only works for datatype class data.frame.

If the datatype was a matrix we would be required to supply both the row and column as in  matrix[“row”,”column”]

By leaving either row or column empty but keeping the comma in place then it acts as a wild card.

matrix[,”column”] – gives all values in that column

matrix[“row”,] – gives all the values in that row

plot() wants a vector (but it forgivingly works with rows of data as we did above).

R data types

What are these datatypes in R?
There is a simple discussion of data types at http://www.statmethods.net/input/datatypes.html

The types are basically (using “value1:value2” gives a list iterating from value1 to value2 by increments of 1)

  • integer
    • > i=1:5
      > class(i)
      [1] "integer"
      > i
      [1] 1 2 3 4 5
  • character
    • > c=letters[1:5]
      > class(c)
      [1] "character"
      > c
      [1] "a" "b" "c" "d" "e"
  • (booleans are integers )
    • > b=FALSE:TRUE
      > class(b)
      [1] "integer"
      > b
      [1] 0 1
  • vectors
    • > v=c(1,2,3,4,5)
      > class(v)
      [1] "numeric"
      > v
      [1] 1 2 3 4 5
  • matrix
    • > m=matrix(c(1,2,3,4,5))
      > class(m)
      [1] "matrix"
      > m
           [,1]
      [1,]    1
      [2,]    2
      [3,]    3
      [4,]    4
      [5,]    5
  • data.frames – mixes numeric and character
    • > df=matrix(1:5,letters[1:5])      # matrix can't contain character and numeric
      Error in matrix(1:5, letters[1:5]) : non-numeric matrix extent
      >
      > df=data.frame(1:5,letters[1:5])  # dataframe can

      > class(df)
      [1] "data.frame"
      > df
        X1.5 letters.1.5.
      1    1            a
      2    2            b
      3    3            c
      4    4            d
      5    5            e
  • lists – like an matrix but can mix different data types together such as character, number, matrix
    •  > a = c(1,2,5.3,6,-2,4) # numeric vector
      > # generates 5 x 4 numeric matrix 
      > y=matrix(1:20, nrow=5,ncol=4)
      > # example of a list with 4 components - 
      > # a string, a numeric vector, a matrix, and a scaler 
      > w= list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
      > w
      $name
      [1] "Fred"
      
      $mynumbers
      [1]  1.0  2.0  5.3  6.0 -2.0  4.0
      
      $mymatrix
           [,1] [,2] [,3] [,4]
      [1,]    1    6   11   16
      [2,]    2    7   12   17
      [3,]    3    8   13   18
      [4,]    4    9   14   19
      [5,]    5   10   15   20
      
      $age
      [1] 5.3
    • extract the various parts of a list with  list[[“name”]], as in w[[“mymatrix”]]
  • array – are matrices with more than 2 dimensions
  • factors

Useful functions on data types

  • dput(var) – will give structure of var
  • class(var) – will tell the data type
  • dim(var) – will set dimension
  • as.matrix(data.frame) – useful for changing a data.frame into a matrix, though be careful because if there are any character values in the data frame then all entries in the matrix will be charcter

Sometimes R transforms data in ways I don’t predict, but the best strategy is just to force R to do what I want more explicitly.

Converting columns into vectors

When originally selecting out the columns of the data, we could have selected out vectors directly instead of selecting a column and transforming the column to a vector.
Instead of asking for the column which gives a column we can ask for every value in that column
by adding in a “,” infront of the column name. The brackets take the equivalent of x and y coordinates or row and column position. By adding a “,” with no value before it, we are giving a wild card to the row identifier and saying give me all the values for all rows in the column “avg_ms”

x=mydata[,'avg_ms']
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

We can also extract the values by the column position instead of column name. The “avg_ms” is column 3

> x=mydata[,3]
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A third way to get the vector format is using “[[ ]]” syntax

> x=mydata[[3]]
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A forth way is with the matrix$col syntax

> x=mydata$avg_ms
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

Another way that we’ll talk about in converting a row to a vector is the apply() and as.numeric() functions:
The function apply can also change a column to a vector

> x=mydata['avg_ms']
> class(x)
[1] "data.frame"
> x
   avg_ms
1    0.02
2    0.04
3    0.20
4    0.03
5    0.04
6    0.23
7    0.06
8    0.07
9    0.18
10   0.03
11   0.01
12   0.06
13   0.01
14   0.01
15   0.02
> x=apply(x,1,as.numeric)
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

These vector extractions work for columns but things are different for rows.

Extracting Rows and converting Rows to numeric vectors

The other side other coin is extracting a row into vector format. In mydata, the rows don’t have names, so we have to use position. By specifying row position with no following column names then all column values are given for that row.

> row=mydata[3,]
> class(row)
[1] "data.frame"
> row
  IOsize threads avg_ms  MB.s
3      0      64    0.2 0.025

The resulting data is a  data frame and not a vector  (ie a vector is of datatype numeric)
We can use the “as.numeric” function to convert the data.frame to a vector, ie numeric.
The apply() function will apply the “as.numeric” function to multiple values at once. The apply() takes 3 args

  • input variable
  • 1=row,2=col,1:2=both
  • function to apply

see http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

> ra=apply(row,2,as.numeric)
> class(ra)
[1] "numeric"
> ra
 IOsize threads  avg_ms    MB.s
  0.000  64.000   0.200   0.025

The above applies  the change to all columns in the given row in a data.frame.

(apply can also be used for example to change all 0 to NULLs

new_matrix = apply(matrix,1:2,function(x)if (x==0)  NULL else x)

see http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega)

For selecting the row out directly as a vector, the as.matrix() function can also be used

> row=as.matrix(mydata)[3,]
> class(row)
[1] "numeric"
> row
 IOsize threads  avg_ms    MB.s
  0.000  64.000   0.200   0.025

yet another way

> row=c(t(mydata[3,]))
> class(row)
[1] "numeric"
> row
[1]  0.000 64.000  0.200  0.025

( see http://stackoverflow.com/questions/2545228/converting-a-dataframe-to-a-vector-by-rows)

or yet

> row=unlist(mydata[3,])
> class(row)
[1] "numeric"
> row
 IOsize threads  avg_ms    MB.s
  0.000  64.000   0.200   0.025

Filtering Data

The data in the CSV file actually represents throughput not only at different I/O send sizes but also for different number of concurrent senders. What if I wanted to just plot the throughput by I/O send size for tests with one thread? How would I filter the data?

IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
plot(IOsize,MBs)

 

How about plotting the throughput by I/O size for each number of threads test?
The parameter ‘type=”o”‘  makes the plot a line plot

#extract data
IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs_1=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
MBs_8=subset(mydata[,'MB.s'],mydata['threads'] == 8 )
MBs_64=subset(mydata[,'MB.s'],mydata['threads'] == 64 )
# create graph
plot(IOsize,MBs_64,type="o")
# plot other lines
lines(IOsize,MBs_1,lty=2,col="green",type="o")
lines(IOsize,MBs_8,lty=3,col="red",type="o")

# add a legend
legend(1,700,c("1 thread","8 threads","64 threads"), cex=0.8,
   col=c("green","red","black"), lty=3:1);

 

 

 

Entering data

Instead of entering data via a CSV file it can be entered directly into R

> m=matrix(c(
     0 ,      1 ,  .02 ,    .010 ,
     0 ,      8 ,  .04 ,    .024 ,
     0 ,     64 ,  .20 ,    .025 ,
     8 ,      1 ,  .03 ,  70.529 ,
     8 ,      8 ,  .04 , 150.389 ,
     8 ,     64 ,  .23 ,  48.604 ,
    32 ,      1 ,  .06 , 149.405 ,
    32 ,      8 ,  .07 , 321.392 ,
    32 ,     64 ,  .18 ,  73.652 ,
   128 ,      1 ,  .03 , 226.457 ,
   128 ,      8 ,  .01 , 557.196 ,
   128 ,     64 ,  .06 , 180.176 ,
  1024 ,      1 ,  .01 , 335.587 ,
  1024 ,      8 ,  .01 , 726.876 ,
  1024 ,     64 ,  .02 , 714.162 ),
nrow=4,ncol=15,
dimnames=list(rows=c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
rows      [,1]  [,2]   [,3]   [,4]    [,5]   [,6]    [,7]    [,8]   [,9]   [,10]   [,11]   [,12]    [,13]    [,14]    [,15]
  IOsize  0.00 0.000  0.000  8.000   8.000  8.000  32.000  32.000 32.000 128.000 128.000 128.000 1024.000 1024.000 1024.000
  threads 1.00 8.000 64.000  1.000   8.000 64.000   1.000   8.000 64.000   1.000   8.000  64.000    1.000    8.000   64.000
  avg_ms  0.02 0.040  0.200  0.030   0.040  0.230   0.060   0.070  0.180   0.030   0.010   0.060    0.010    0.010    0.020
  MB/s    0.01 0.024  0.025 70.529 150.389 48.604 149.405 321.392 73.652 226.457 557.196 180.176  335.587  726.876  714.162

> t(m)
        IOsize threads avg_ms    MB/s
   [1,]      0       1   0.02   0.010
   [2,]      0       8   0.04   0.024
   [3,]      0      64   0.20   0.025
   [4,]      8       1   0.03  70.529
   [5,]      8       8   0.04 150.389
   [6,]      8      64   0.23  48.604
   [7,]     32       1   0.06 149.405
   [8,]     32       8   0.07 321.392
   [9,]     32      64   0.18  73.652
  [10,]    128       1   0.03 226.457
  [11,]    128       8   0.01 557.196
  [12,]    128      64   0.06 180.176
  [13,]   1024       1   0.01 335.587
  [14,]   1024       8   0.01 726.876
  [15,]   1024      64   0.02 714.162

The bizarre thing about this is that the nrows corresponds to the number of columns and the matrix comes out transposed. Using t() can re-transpose it, but this is all confusing.
To make it more intuitive add the argument
"byrow=TRUE,"
and add a
"NULL"
for the rowname position in the row and columns name section

m=matrix(c(
     0 ,      1 ,  .02 ,    .010 ,
     0 ,      8 ,  .04 ,    .024 ,
     0 ,     64 ,  .20 ,    .025 ,
     8 ,      1 ,  .03 ,  70.529 ,
     8 ,      8 ,  .04 , 150.389 ,
     8 ,     64 ,  .23 ,  48.604 ,
    32 ,      1 ,  .06 , 149.405 ,
    32 ,      8 ,  .07 , 321.392 ,
    32 ,     64 ,  .18 ,  73.652 ,
   128 ,      1 ,  .03 , 226.457 ,
   128 ,      8 ,  .01 , 557.196 ,
   128 ,     64 ,  .06 , 180.176 ,
  1024 ,      1 ,  .01 , 335.587 ,
  1024 ,      8 ,  .01 , 726.876 ,
  1024 ,     64 ,  .02 , 714.162 ),
nrow=15,ncol=4,byrow=TRUE,
dimnames=list(NULL,c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
     IOsize threads avg_ms    MB/s
 [1,]      0       1   0.02   0.010
 [2,]      0       8   0.04   0.024
 [3,]      0      64   0.20   0.025
 [4,]      8       1   0.03  70.529
 [5,]      8       8   0.04 150.389
 [6,]      8      64   0.23  48.604
 [7,]     32       1   0.06 149.405
 [8,]     32       8   0.07 321.392
 [9,]     32      64   0.18  73.652
[10,]    128       1   0.03 226.457
[11,]    128       8   0.01 557.196
[12,]    128      64   0.06 180.176
[13,]   1024       1   0.01 335.587
[14,]   1024       8   0.01 726.876
[15,]   1024      64   0.02 714.162

Vectorwise maximum/minimum

Another issues is trying to get the max or min of two or more values on a point by point basis.
Using the “min()” function gives a single minimum and not a minimum on a point by point basis.
Use “pmax()” and “pmin()” to get point by point max and min of two or more vectors.
> lat
[1]  44.370  22.558  37.708  73.070 131.950
> std
[1]  37.7  21.6  67.1 136.1 186.0
> min
[1] 0.0 0.6 0.6 1.0 1.0
> pmax(lat-std,min)
[1] 6.670 0.958 0.600 1.000 1.000

 

Column Sums and Row Sums

to sum up rows or colums use “rowSums()” and   “colSUms()”

http://stat.ethz.ch/R-manual/R-patched/library/base/html/colSums.html

For more info

for more info on data types and manipulation see

see: http://cran.r-project.org/doc/manuals/R-intro.html

Uncategorized

No 3d charts in Excel? try R

January 2nd, 2013

 

I wanted to plot a set of data by 3 dimensions.

I wanted to plot I/O read latency by MB/s throughput by number of concurrent readers. Seemed simple. Well it turns out there is no good way to do it in Excel. Sure Excel has 3d charts but, attention, the z axis is treated like rows and not values. For example

Note that the z axis, “users” had 3 values marked  on the axis. Those 3 values are 1,16 and 64. Notice that  16 is as far from 1 as 64 is from 16, ie the distance is not proportional to the value.

There is a free plug-in for Excel called Excel3Dscatterplot, but the data is hard to read, for example

Luckily R saves the day. With R there are a number of easy ways to graph 3d data.

Let’s take a data set that represents network I/O showing  the send size in KB (s_KB),  number of concurrent sender threads (thrds), the MB send throughput (s_MB/s) and then avg, min and max latency (mn_ms, avg_ms, max_ms).

s_KB ,thrds ,mn_ms ,avg_ms , max_ms , s_MB/s
     ,    1 ,  .02 ,   .05 ,   3.06 ,   .010 
     ,    8 ,  .04 ,   .12 ,   5.27 ,   .024 
     ,   64 ,  .20 ,   .82 ,  12.03 ,   .025 
   8 ,    1 ,  .03 ,   .06 ,   8.47 , 70.529 
   8 ,    8 ,  .04 ,   .17 ,  21.64 ,150.389 
   8 ,   64 ,  .23 ,  1.31 ,  20.50 , 48.604 
  32 ,    1 ,  .06 ,   .10 ,   1.82 ,149.405 
  32 ,    8 ,  .07 ,   .32 ,  16.78 ,321.392 
  32 ,   64 ,  .18 ,  5.32 , 380.02 , 73.652 
 128 ,    1 ,  .03 ,   .28 ,   2.01 ,226.457 
 128 ,    8 ,  .01 ,   .80 ,  54.78 ,557.196 
 128 ,   64 ,  .06 , 11.77 ,  77.96 ,180.176
1024 ,    1 ,  .01 ,  1.49 ,   5.76 ,335.587
1024 ,    8 ,  .01 ,  5.35 , 118.48 ,726.876
1024 ,   64 ,  .02 , 40.50 , 221.59 ,714.162

to plot this in R, first start R up (see http://scs.math.yorku.ca/index.php/R:_Getting_started_with_R#Installing_R )

find out where R’s working directory is

   getwd()

go to a directory with my data and r files

   setwd("C:/Users/Kyle/Documents/GitHub/nio")

list files

   dir()

load and plot
   nio <- read.csv("perfibmd1_loopback_short.csv")

   nio
s_KB thrds mn_ms avg_ms max_ms  s_MB.s
1    NA     1  0.02   0.05   3.06   0.010
2    NA     8  0.04   0.12   5.27   0.024
3    NA    64  0.20   0.82  12.03   0.025
4     8     1  0.03   0.06   8.47  70.529
5     8     8  0.04   0.17  21.64 150.389
6     8    64  0.23   1.31  20.50  48.604
7    32     1  0.06   0.10   1.82 149.405
8    32     8  0.07   0.32  16.78 321.392
9    32    64  0.18   5.32 380.02  73.652
10  128     1  0.03   0.28   2.01 226.457
11  128     8  0.01   0.80  54.78 557.196
12  128    64  0.06  11.77  77.96 180.176
13 1024     1  0.01   1.49   5.76 335.587
14 1024     8  0.01   5.35 118.48 726.876
15 1024    64  0.02  40.50 221.59 714.162


library(scatterplot3d)
MBs=nio[,'s_MB.s']
threads=nio[,'thrds']
IOsize=nio[,'s_KB']
s3d <-scatterplot3d(IOsize,threads,MBs)

 

That’s a bit simple, but we can add some easy improvements. Here the dots are made solid, with color highlighting and a drop line to the bottom plane

s3d <-scatterplot3d(IOsize,threads,MBs,pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot")
In this test I’m running five different IO sizes 1 byte, 8k, 32k, 128k and 1024k. It would be nice to draw a line for each I/O size by increasing number of threads. There is no way to use 3 coordinates to identify a point, but there is a conversion routine that converts 3 coordinate point to two
s$xyz.convert(x,y,z)
There are 5 sets of points, i.e. one set per I/O size. Each set had 3 points, i.e. one point for each different number of concurrent threads 1,8,64.
The code loops through the 5 sets and draw a line through the 3 points:

x=nio[,'IOsize']
y=nio[,'threads']
z=nio[,'s_MB.s']
s <- scatterplot3d(x,y,z,xlab="IOsize", ylab="threads", zlab="MBs")
   for ( i in  1:5  )  {
     j=i*3
     p1 <- s$xyz.convert(x[j-2],y[j-2],z[j-2])
     p2 <- s$xyz.convert(x[j-1],y[j-1],z[j-1])
     p3 <- s$xyz.convert(x[j],y[j],z[j])
     segments(p1$x,p1$y,p2$x,p2$y,lwd=2,col=i)
     segments(p2$x,p2$y,p3$x,p3$y,lwd=2,col=i)
  }

It’s also easy to rotate the data. Here are two larger datasets

example

m=matrix(c(
 0 ,    1 , 0.02 ,  0.06 ,   4.72 ,   0 ,  0.008  ,  0.000  ,99.19 , 0.67 , 0.05 , 0.09 , 0.00 , 0.00 , 0.00 , 0.00 , 0.06 , 0.08 ,    1 ,
   0 ,    2 , 0.03 ,  0.05 ,   2.35 ,   0 ,  0.020  ,  0.000  ,99.49 , 0.39 , 0.07 , 0.05 , 0.00 , 0.00 , 0.00 , 0.00 , 0.04 , 0.07 ,    0 ,
   0 ,    4 , 0.03 ,  0.07 ,  14.27 ,   0 ,  0.026  ,  0.000  ,98.33 , 1.18 , 0.15 , 0.24 , 0.06 , 0.04 , 0.00 , 0.00 , 0.05 , 0.07 ,    0 ,
   0 ,    8 , 0.03 ,  0.13 ,  34.08 ,   0 ,  0.024  ,  0.000  ,93.49 , 4.91 , 0.51 , 0.66 , 0.13 , 0.31 , 0.00 , 0.00 , 0.05 , 0.14 ,    0 ,
   0 ,   16 , 0.04 ,  0.26 ,  31.27 ,   0 ,  0.025  ,  0.000  ,56.29 ,42.31 , 0.16 , 0.57 , 0.31 , 0.30 , 0.06 , 0.00 , 0.11 , 0.16 ,    0 ,
   0 ,   32 , 0.10 ,  0.35 ,   3.57 ,   0 ,  0.029  ,  0.000  ,39.26 ,57.80 , 0.71 , 1.08 , 0.49 , 0.66 , 0.00 , 0.00 , 0.18 , 0.43 ,    0 ,
   0 ,   64 , 0.16 ,  0.70 ,  39.93 ,   0 ,  0.032  ,  0.000  , 1.28 ,91.69 , 3.18 , 1.64 , 1.25 , 0.37 , 0.59 , 0.00 , 0.33 , 0.70 ,    0 ,
   8 ,    1 , 0.02 ,  0.07 ,   8.17 ,   0 , 59.668  ,  0.000  ,93.41 , 6.34 , 0.13 , 0.11 , 0.01 , 0.00 , 0.00 , 0.00 , 0.05 , 0.10 ,    0 ,
   8 ,    2 , 0.03 ,  0.07 ,   9.75 ,   0 ,111.486  ,  0.000  ,98.18 , 1.35 , 0.24 , 0.18 , 0.04 , 0.00 , 0.00 , 0.00 , 0.06 , 0.08 ,    0 ,
   8 ,    4 , 0.03 ,  0.09 ,  39.08 ,   0 ,154.623  ,  0.000  ,97.09 , 1.86 , 0.47 , 0.47 , 0.07 , 0.04 , 0.00 , 0.00 , 0.06 , 0.08 ,    0 ,
   8 ,    8 , 0.05 ,  0.15 ,  18.48 ,   0 ,171.580  ,  0.000  ,78.14 ,19.78 , 0.74 , 1.05 , 0.11 , 0.19 , 0.00 , 0.00 , 0.09 , 0.29 ,    0 ,
   8 ,   16 , 0.05 ,  0.36 ,  37.54 ,   0 ,154.778  ,  0.000  ,51.87 ,44.81 , 0.93 , 0.97 , 0.80 , 0.62 , 0.00 , 0.00 , 0.11 , 0.51 ,    0 ,
   8 ,   32 , 0.06 ,  0.62 ,  43.06 ,   0 ,152.741  ,  0.000  ,15.84 ,66.88 ,13.27 , 1.64 , 0.64 , 1.42 , 0.31 , 0.00 , 0.11 , 0.55 ,    0 ,
   8 ,   64 , 0.24 ,  1.07 ,  24.72 ,   0 ,166.518  ,  0.000  , 8.00 ,58.29 ,25.47 , 5.75 , 1.99 , 0.45 , 0.04 , 0.00 , 0.43 , 3.34 ,    0 ,
  32 ,    1 , 0.05 ,  0.11 ,   8.46 ,   0 ,145.238  ,  0.000  ,40.63 ,58.08 , 1.14 , 0.15 , 0.01 , 0.00 , 0.00 , 0.00 , 0.10 , 0.13 ,    0 ,
  32 ,    2 , 0.04 ,  0.12 ,   5.20 ,   0 ,248.208  ,  0.000  ,25.14 ,73.08 , 1.47 , 0.25 , 0.05 , 0.00 , 0.00 , 0.00 , 0.11 , 0.14 ,    0 ,
  32 ,    4 , 0.06 ,  0.19 ,  22.94 ,   0 ,297.655  ,  0.000  ,11.05 ,85.37 , 2.40 , 0.97 , 0.15 , 0.06 , 0.00 , 0.00 , 0.12 , 0.22 ,    0 ,
  32 ,    8 , 0.05 ,  0.32 ,  21.12 ,   0 ,360.777  ,  0.000  , 5.75 ,87.25 , 3.21 , 3.06 , 0.45 , 0.26 , 0.00 , 0.00 , 0.12 , 0.83 ,    0 ,
  32 ,   16 , 0.09 ,  0.62 ,  22.54 ,   0 ,362.603  ,  0.000  , 3.22 ,81.87 , 6.48 , 6.21 , 1.15 , 1.06 , 0.00 , 0.00 , 0.25 , 1.88 ,    0 ,
  32 ,   32 , 0.08 ,  1.14 ,  40.08 ,   0 ,369.589  ,  0.000  , 1.85 ,71.97 ,14.28 , 8.46 , 1.24 , 1.95 , 0.25 , 0.00 , 0.42 , 4.55 ,    0 ,
  32 ,   64 , 0.38 ,  2.23 ,  33.70 ,   0 ,319.048  ,  0.000  , 0.00 ,36.12 ,39.33 ,18.56 , 2.61 , 3.31 , 0.07 , 0.00 , 0.68 , 4.97 ,    5 ,
 128 ,    1 , 0.02 ,  0.29 ,   8.71 ,   0 ,212.555  ,  0.000  , 0.65 ,98.08 , 1.05 , 0.19 , 0.02 , 0.00 , 0.00 , 0.00 , 0.28 , 0.34 ,    0 ,
 128 ,    2 , 0.01 ,  0.33 ,   8.54 ,   0 ,376.113  ,  0.000  , 6.95 ,87.98 , 4.28 , 0.76 , 0.02 , 0.00 , 0.00 , 0.00 , 0.28 , 0.64 ,    0 ,
 128 ,    4 , 0.01 ,  0.46 ,  35.22 ,   0 ,533.401  ,  0.000  ,31.45 ,62.25 , 5.15 , 1.04 , 0.06 , 0.05 , 0.00 , 0.00 , 0.23 , 0.68 ,    1 ,
 128 ,    8 , 0.01 ,  0.83 ,  89.39 ,   0 ,545.988  ,  0.000  ,23.75 ,59.64 ,10.87 , 4.96 , 0.25 , 0.51 , 0.02 , 0.00 , 0.28 , 0.97 ,    0 ,
 128 ,   16 , 0.01 ,  1.57 ,  54.30 ,   0 ,554.879  ,  0.000  , 8.77 ,54.71 ,19.95 ,14.16 , 1.36 , 0.98 , 0.06 , 0.00 , 0.37 , 2.52 ,    0 ,
 128 ,   32 , 0.04 ,  3.21 ,  86.26 ,   0 ,549.206  ,  0.000  , 2.61 ,41.04 ,26.36 ,23.33 , 3.27 , 3.00 , 0.38 , 0.00 , 0.49 , 5.87 ,    0 ,
 128 ,   64 , 0.07 ,  6.93 , 115.71 ,   0 ,502.197  ,  0.000  , 5.93 ,32.15 ,16.22 ,30.94 , 7.48 , 5.84 , 1.26 , 0.17 , 0.79 ,20.10 ,    0 ,
1024 ,    1 , 0.04 ,  2.57 ,  12.71 ,   0 ,194.751  ,  0.000  ,39.55 ,47.67 , 0.98 ,11.80 , 0.00 , 0.00 , 0.00 , 0.00 , 0.11 , 2.16 ,    0 ,
1024 ,    2 , 0.01 ,  1.97 ,  10.25 ,   0 ,506.611  ,  0.000  ,48.02 ,36.93 , 0.20 ,14.84 , 0.01 , 0.00 , 0.00 , 0.00 , 0.11 , 1.42 ,    0 ,
1024 ,    4 , 0.01 ,  2.73 ,  44.98 ,   0 ,724.303  ,  0.000  ,17.76 ,70.16 , 0.64 ,11.37 , 0.04 , 0.03 , 0.00 , 0.00 , 0.11 , 1.66 ,    1 ,
1024 ,    8 , 0.01 ,  5.27 ,  91.10 ,   0 ,738.061  ,  0.000  ,16.37 ,66.91 , 2.36 ,12.81 , 0.72 , 0.74 , 0.08 , 0.01 , 0.11 , 1.88 ,    0 ,
1024 ,   16 , 0.01 , 10.37 , 157.60 ,   0 ,727.271  ,  0.000  ,13.81 ,58.74 , 8.07 ,14.70 , 2.30 , 2.19 , 0.18 , 0.01 , 0.15 , 4.63 ,    0 ,
1024 ,   32 , 0.01 , 20.61 , 116.30 ,   0 ,726.869  ,  0.000  ,11.06 ,49.24 ,13.88 ,16.86 , 3.71 , 4.41 , 0.62 , 0.22 , 0.26 ,10.11 ,    1 ,
1024 ,   64 , 0.06 , 39.57 , 316.81 ,   0 ,721.422  ,  0.000  , 2.40 ,40.93 ,21.18 ,20.12 , 5.73 , 7.25 , 1.47 , 0.91 , 0.56 ,22.30 ,    0

),
nrow=35,ncol=19,byrow=TRUE, dimnames=list(NULL,c(
's_KB','thrds','mn_ms','avg_ms','max_ms','r_KB','s_MB/s','r_MB/s',' <100u','<500u','<1ms','<5ms','<10ms','<50ms','<100m',' <1s',' p50',' p95','retrans'
)))MBs=m[,'s_MB/s']
threads=m[,'thrds']
IOsize=m[,'s_KB']x=IOsize
y=threads
z=MBs

library(scatterplot3d)
s <- scatterplot3d(x,y,z,xlab="IOsize", ylab="threads", zlab="MBs")
nthreads=7
for ( i in  1:5  )  {
  beg=(1+(i-1)*nthreads)
  end=(beg+nthreads-2)
  cat( beg," ",end,"\n")
  for ( j in beg:end ) {
       p1 <- s$xyz.convert(x[j],y[j],z[j])
       p2 <- s$xyz.convert(x[j+1],y[j+1],z[j+1])
       segments(p1$x,p1$y,p2$x,p2$y,lwd=2,col=i)
   }
}

rotating the axis

z=IOsize
x=threads
y=MBs
s <- scatterplot3d(x,y,z,pch=16,highlight.3d=TRUE,type="h",zlab="IOsize", xlab="threads", ylab="MBs")
nthreads=7
for ( i in  1:5  )  {
  beg=(1+(i-1)*nthreads)
  end=(beg+nthreads-2)
  cat( beg," ",end,"\n")
  for ( j in beg:end ) {
       p1 <- s$xyz.convert(x[j],y[j],z[j])
       p2 <- s$xyz.convert(x[j+1],y[j+1],z[j+1])
       segments(p1$x,p1$y,p2$x,p2$y,lwd=2,col=i)
   }
}
There is more to do.  Wanted to hurry up and post and will add more later. Would be good to make the IOsize axis log scale because all the small sizes 1 byte, 8k, 32k and 128k are all bunched together. The library scatterplot3d has log axis scoped but not implemented. The call will take a log argument but not yet use it, thus in the mean time one can run log10 on the values and update the tick labels.
 All that being said, I’m dubious of the use and utility of 3d charts in most situations. Sure graphing a 3d  function where things are changing seamlessly in all directions can be insightful, but for simple data like the above, 2d is probably clearer

Uncategorized

Enqueue – is it a PK, FK or Bitmap Index problem?

November 30th, 2012

If one is seeing  ‘enq: TX – row lock contention’ there could be a lot of reasons. One distinguishing factor is the lock mode. If the lock mode is exclusive (mode 6) then it’s most likely a classic row lock where two sessions are trying to modify the same row. On the other hand if the lock mode is share (mode 4)  it’s typically going to be

  • inserting a unique key when  someone else has already inserted that key but not committed
  • Inserting a foreign when then parent value has been inserted but not committed or deleted and not commited (not to be confused with locks due to un-indexed foreign key which cause a “enq: TM – contention”  wait not a TX wait)
  • bitmap index chunk contention

Now how to tell which of these is happening? Well here is a query on ASH (I’ve commented out some of the useful fields to limit the output)  and a results cheat sheet:

col object for A15
col otype for A10
select
       substr(event,0,20)    lock_name,
       --ash.session_id        waiter,
       --mod(ash.p1,16)        lmode,
       --ash.p2                p2,
       --ash.p3                p3,
       o.object_name           object,
       o.object_type           otype,
       CURRENT_FILE#           filen,
       CURRENT_BLOCK#          blockn,
       --ash.SQL_ID            waiting_sql,
       BLOCKING_SESSION        blocker
       --,ash.xid
from
         v$active_session_history ash,
         all_objects o
where
           event like 'enq: TX%'
   and   mod(ash.p1,16)=4
   and o.object_id (+)= ash.CURRENT_OBJ#
/

uniq index

  OBJECT   OTYPE FILEN BLOCKN BLOCKER
  ------  ------ ----- ------ -------
      -1             0      0     158
      -1             0      0     158
      -1             0      0     158
      -1             0      0     158

FK 

  OBJECT   OTYPE FILEN BLOCKN BLOCKER
  ------  ------ ----- ------ -------
   CHILD   TABLE     1  60954       1
   CHILD   TABLE     1  60954       1
   CHILD   TABLE     1  60954       1

bitmap

  OBJECT   OTYPE FILEN BLOCKN BLOCKER
  ------  ------ ----- ------ -------
      I1   INDEX     0      0     144
      I1   INDEX     0      0     144
      I1   INDEX     0      0     144
      I1   INDEX     0      0     144

Each case has a different footprint.

  • unique key index issue object of “-1”
  • foreign key case has a blocker of “1”
  • bitmap index case as filen and blockn “0”

These cases were run on 10.2.0.3 thus the “footprint” could change on other versions.

The above ASH query and many other useful ASH queries are maintained on GitHub at

https://github.com/khailey/ashmasters

Uncategorized

d3.js

November 29th, 2012

Here is a short video

D3 Show Reel from Mike Bostock on Vimeo.

Here is a longer tutorial video

Data-driven Documents from London Web Standards on Vimeo.

You can go through the actual presentation slides at http://lws.node3.org/#landing-slide
NOTE: these”slides” are active pages. You can click on the pages and interact with them.
They are in live d3. Try this page for example and click on one of the  points in the graph
 
 
This page is a fun one: http://lws.node3.org/#demo-3

Drag your mouse around the page – it eventually draws a picture

Uncategorized

Importance of feedback

November 13th, 2012

 

The human brain works so well because it calculates outcomes as it takes actions feeding back the current results compared to the expected results  which allows immediate corrections if the two are not aligning.

Such a immediate comparison between expected results and actual is an important functionality to include into software in order to have the best results especially when it comes to usability and user friendliness.

Software that doesn’t give any feedback until a job is supposedly finished is setting itself up for at worst failure or at least wasted time and effort. As  Cary Millsap recounts in one of his presentations “imagine practicing your put and after you hit the ball you have wait an hour to find out where your ball went.” That would make it tough to learn how to put better.

Here is a nice video that helps get this point across. The first two examples are visual. Go to 16:50 for a more typical programming code task, binary search.

Bret Victor – Inventing on Principle from CUSEC on Vimeo.

Thanks to John Beresniewicz for the link.

Another good reference on the subject of how the brain works “On Intelligence“. Thanks again to John for introducing me to this book.

Some of the ideas are the ideas that John and I wanted in Oracle Enterprise Manager in 10g. At  the time the means we had were limited.  When John and I first arrived at Oracle to work on Enterprise 10g, the only graphics were PNG images. There was no interactive jquery, no flash, no svg. After much arguing we were able to get SVG in, but still the design work was limited mainly because of political factors. It’s amazing that there is so much political (ie non-empirical) argument about bot creating interactive feedback driven interfaces. Sure they are more work but that work is well worth it.  Time are changing though especially with html5 and jquery and the future looks exciting.

 

Uncategorized

SQL Riddle for Free NoCOUG admission Nov 15

November 9th, 2012

Tom Kyte, Tim Gorman, Kellyn Pot’Vin, and Ben Prusinski are flying in to the Bay Area to present at NoCOUG Nov 15.

I’ll be presenting as well on how to make dozens of multi-terabyte database copies in seconds using various easily available technologies on the market now.

http://www.technicalconferencesolutions.com/pls/caat/caat_abstract_reports.schedule?conference_id=116

NoCOUG is offering free admission to the first 25 people who answer the following SQL riddle:

SQL comes in two distinct flavors—“relational calculus” and “relational algebra.” Without sweating the technical details, let’s just say that the relational calculus flavor is characterized by correlated subqueries—subqueries that refer to outside values—while the relational algebra flavor is characterized by set operations such as JOIN, UNION, MINUS, and INTERSECT. And, as you have probably noticed, these flavors are often mixed. The SQL mini-challenge is to use the pure relational algebra flavor of SQL to list all students who have enrolled in all the courses required by their declared major. Here are the table definitions and sample data. Send your entry to sqlchallenge@nocoug.org. The first 25 correct entries will receive a free admission code to the November 15 conference.

CREATE TABLE students
(
student_id INTEGER NOT NULL,
major_id INTEGER NOT NULL,
CONSTRAINT students_pk
PRIMARY KEY (student_id)
);

INSERT INTO students VALUES (1, 1);
INSERT INTO students VALUES (2, 1);
INSERT INTO students VALUES (3, 1);
INSERT INTO students VALUES (4, 1);

CREATE TABLE requirements
(
major_id INTEGER NOT NULL,
course_id INTEGER NOT NULL,
CONSTRAINT requirements_pk
PRIMARY KEY (major_id, course_id)
);

INSERT INTO requirements VALUES (1, 1);
INSERT INTO requirements VALUES (1, 2);

CREATE TABLE enrollments
(
student_id INTEGER NOT NULL,
course_id INTEGER NOT NULL,
CONSTRAINT enrollments_pk
PRIMARY KEY (student_id, course_id),
CONSTRAINT enrollments_fk1
FOREIGN KEY (student_id) REFERENCES students
);

INSERT INTO enrollments VALUES (1, 1);
INSERT INTO enrollments VALUES (1, 2);
INSERT INTO enrollments VALUES (2, 1);
INSERT INTO enrollments VALUES (3, 3);
INSERT INTO enrollments VALUES (4, 1);
INSERT INTO enrollments VALUES (4, 3);

Here are three solutions using the relational calculus flavor of SQL.

— Select students for whom the count of enrolled required courses equals the count of required courses

SELECT s.student_id
FROM students s
WHERE
(
SELECT COUNT(*)
FROM requirements r, enrollments e
WHERE r.major_id = s.major_id
AND e.student_id = s.student_id
AND e.course_id = r.course_id
) =
(
SELECT COUNT(*)
FROM requirements r
WHERE r.major_id = s.major_id
);

— Use double negation
— Select students such that there does not exist a required course in which they have not enrolled

SELECT s.student_id
FROM students s
WHERE NOT EXISTS
(
SELECT *
FROM requirements r
WHERE r.major_id = s.major_id
AND NOT EXISTS
(
SELECT *
FROM enrollments e
WHERE e.student_id = s.student_id
AND e.course_id = r.course_id
)
);

— Use object-relational techniques
— Select students for whom the set of required courses is a subset of the set of enrolled courses

CREATE TYPE list_type AS TABLE OF INTEGER;
/

SELECT s.student_id
FROM students s
WHERE

CAST(MULTISET(
SELECT r.course_id
FROM requirements r
WHERE r.major_id = s.major_id
) AS list_type)

SUBMULTISET OF

CAST(MULTISET(
SELECT e.course_id
FROM enrollments e
WHERE e.student_id = s.student_id
) AS list_type);

Here is a solution that uses a mixed flavor of SQL. Notice the use of the MINUS operation.

— Select students for whom the set of required courses is a subset of the set of enrolled courses

SELECT s.student_id
FROM students s
WHERE NOT EXISTS
(
SELECT r.course_id
FROM requirements r
WHERE r.major_id = s.major_id

MINUS

SELECT e.course_id
FROM enrollments e
WHERE e.student_id = s.student_id
);

Please forward this message to your friends and colleagues. Our conferences are suitable for anybody with an interest in Oracle Database.

 

 

Uncategorized

Oaktable World 2012 – all presentations and 4 vidoes up

November 8th, 2012

Oaktable world (aka Oracle Closed World) was an another amazing conference. We are already planning 2013

More videos will be posted as we have time to process the footage (help from video editors more than encouraged, email me Kyle.Hailey at  delphix.com )

Here are a few photos from this year:

Thanks everyone for an awesome conference.

Special thanks to conferences patrons

  • Paul Vallee
  • Tim Gorman
  • Kerry Osborne
  • Marco Gralike
  • Jonathan Gennick
  • Marcin Przepiorowski

Without their help the conference wouldn’t have been the same.

Thanks to all the speakers

Oracle “Ted” Talks

– Kyle Hailey
http://dboptimizer.com

Uncategorized

Facebook news feed performance

November 7th, 2012

From the article “Facebook shares some secrets on making MySql scale

“800 million users and handling more than 60 million queries per second” …”4 million row changes per second.”

and that was almost a year ago. Think what it’s like now!

Ever wonder why Facebook limits your friends to 5000?  Does Facebook want to stop people from using it to promote themselves?

Ever see this message “There are no more posts to show right now” on Facebook?

Notice it says “There are no more posts to show right now.”

I got this message when scrolled back in “friends” status updates. After scrolling back a few days I hit this message. The message is strange since  thousands of more status updates that Facebook could have shown me. Why did I run into this message?

Is Facebook just being capricious or dictatorial in how it is used? I don’t know but I think the more likely answer is much more mundane and possibly quite interesting. The reason may be just simple technical limitations.

How could would/should/could status updates be stored on Facebook?

The first thing that comes to mind is something like these tables in a relational database:In the above design there are 3 tables

  • Facebook accounts
    • ID
    • email
    • other fields
  • Friends
    • user id
    • friend’s id (contact id or c_id)
  • Status Updates
    • ID of the account making the update
    • status update
    • date of status update

So if sue@y.com logs onto Facebook, then Facebook needs to go and get the status updates of her friends/contacts. First step is to get a list of friends and second step is to get a list of updates from those friends. In SQL this might look like:

    Select  id, status
    From updates
    where id in (select c_id from contacts where id=2)
    order by date

As the number of friends and status updates increases, then this query is going to take longer and longer. Maybe this is the reason why Facebook limits the number of friends and the history.  How can the response time for  the retreval of updates of friends be kept at constant time ?

First, the home page only has to show, at least initially, something like 20 updates. The above query can be wrapped with a top 20 s0mething like

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2)
      order by date)
   where rownum < 20;

But really, that’s not going to do much good because the query still has to create the result set before sorting it by date then limiting the output to 20 rows. You could add a date limiter on the updates:

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2) and
      date <= current_date - 2_days
      order by date)
   where rownum < 20;

Seems facebook has a limit on the number of days returned and the number of friends, but there isn’t AFAIK, a limit on the number of updates that friends can do, so as they do more updates, the query takes longer and longer.

What kind of other design could be used? To speed up the query data could be denormalized a lot or a little. For a small change in the data, the date could be added to the list of friends meaning we can limit updates by the date field in  friends instead of all the updates themselves  as in:
Now the query becomes something like

   Select  status
   From updates
   where id in  (  select c_id from
                    (select c_id from contacts where id=2  order by date)
               where rownum < 20 )
   order by date

Instead of having to select status updates from all the friends, the query just selects the 20 (or less) friends who have had the most recent updates.

Or one could go a step farther such that when you post a status update,  a row gets inserted for each of your friends,  such that every friend has your update associeted with them and then all that has to be done is select the top 20 updates from that list. No joining. And if  indexed, then the rows returned can be precisely limited to those 20 rows. On the other hand this creates an enormous amount of insert data and data redundancy. Maybe have two tables, 1 status updates with a unique id and 2  a table with all friends updates. The second table would have every user and for each user a line that contains the status update ids of all their friends and a timestamp.    So if I wanted status updates for my friends, I just get the last 20 status update ids from this table for me and then get the actual content for 20 status updates. Still this keeps a lot of unnecessary information. On the other hand I don’t need to keep the data for that long – maybe the last couple days and beyond that the system could fall back to some of the join query above.

What other kinds of optimizations could they do ?  What would the pros be of a other methods? What are the cons?

This has already been solved a number of times at a number of places.  I haven’t been involved in any nor am I involved in any of these architectural questions right now, but it’s interesting to think about.

Why does Facebook want to know who your close friends are? Is it because they care or because it helps prioritize what status up dates to denormalize? Why do the limit friends  to 5000? Is it because they really care or is scaling issue?

 

Related Reading:

Twitter

id generation

http://engineering.twitter.com/2010/06/announcing-snowflake.html

http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

Facebook schema

http://www.flickr.com/photos/ikhnaton2/533233247/

Facebook lamp stack

http://itc.conversationsnetwork.org/shows/detail4746.html

how does Facebook do it

http://ask.metafilter.com/82769/How-is-Facebook-doing-its-queries

ebay

http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

high scalability

http://highscalability.com/

http://memcached.org/

scaling

http://danga.com/words/2007_06_usenix/usenix.pdf

Flickr

http://radar.oreilly.com/archives/2006/04/database-war-stories-3-flickr.html

Myspace

http://www.baselinemag.com/c/a/Projects-Networks-and-Storage/Inside-MySpacecom/

dealing with stale data

http://www.mnot.net/blog/2007/12/12/stale

Facebook schema

http://upload.wikimedia.org/wikipedia/commons/9/95/Metamodel_of_Facebook.jpg

 

Uncategorized

Oaktable World first 3 vidoes up

October 22nd, 2012

The first 3 videos of talks at Oaktable World are up:

Connor McDonald 11g RAC upgrade drama
http://oaktableworld.com/video-connor-mcdonald-11g-rac-upgrade-drama/

Tanel Poder hacking session:
http://oaktableworld.com/tanel-poder-hacking-session-video/

Tim Gorman on exchange partition
http://oaktableworld.com/video-tim-gorman-on-exchange-partition/

We still have more video material to come but only limited resources to go through and edit them.  Comment here if you are interested in help out edit video. For a list of talks see:

http://oaktableworld.com/

 

 

Uncategorized