Latest Entries »

If you use BinaryComparable.getBytes(), you will need to subset that array to the length in BinaryComparable.getLength().

Otherwise you will find junk from previous rows in your byte array.

Might be worth documenting the API, just saying.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/BinaryComparable.html#getBytes()

I am getting excited about seeing large data systems employ InfiniBand. Granted, high speed interconnects have always played an important role for high performance computing. What’s exciting for me is the implications of remote memory access and diversifying of capability this provides.

For example, suppose I have a large database system that is pretty well balanced in CPU and IO. New use cases arrive that require a very large CPU draw with low IO impact. One option might be to expand the cluster but doing so would oversubscribe my IO capability at significant cost (IO is usually most expensive). Alternately, with a system configured to use InfiniBand, the software could employ remote memory access to effectively “beam” a parcel of data from my existing cluster to a new high CPU cluster which can burn through the necessary computation and “beam” back the results. Enabling a cost effective high performance co-processor capability.

Why I think this is compelling is how InfiniBand abstracts this capability into a simple implementation and the speed at which InfiniBand can perform this work.

I wouldn’t say InfiniBand is new but I do think more enterprise-ish parallel computing systems are embracing InfiniBand (Teradata, Exadata). I wondered what Hadoop might get out of InfiniBand, interestingly enough some folks already took a look at this (link). What I gleaned, Hadoop’s software doesn’t embrace the necessary protocols (NIO) to take advantage of InfiniBand from a socket perspective and its is probably a long ways away from embracing the remote memory access model.

 

 

 

 

I was switching a project from blocking IO to non-blocking IO.

For the most part, Java wants you to use ByteBuffer for all the read/write operations.

I don’t know if it is just me but ByteBuffer does not work like I would expect it to.

I have a ByteBuffer with 50 bytes of data and capacity of 100, I issue the write. The write only sends the first 30 bytes. I would expect I could add another 70 bytes of data into the ByteBuffer. However, that does not seem to be the case. It’s like the designers wanted you to empty out the buffer before you added any more to it.

Like I said, maybe it’s just me but seemed counter-intuitive.

10 hours of certification tests

I signed up for some beta certification tests. I ended up scheduling all three this week. In total, I think I spent just shy of ten hours of test taking the last three days. Today was especially mind melting. I was ready to quit on question of 160 of 230 but I finished though my focus was diminished by the end.

It would of been nice to take notes on all the things I wasn’t sure of for the noble purpose of going back and determining the correct behavior but that’s not how it works.

So I will celebrate tonight and let my brain recharge with some college football tomorrow.

Subquery UDF

Problem

Users keep thinking a SQL UDF runs SQL.

They think the SQL UDF works just like the conceptual lookup.

Select Widget_Id, Get_Widget_Brand_Name(Widget_Id) From Widgets

They don’t think to write it as SQL.

Select Widgets.Widget_Id, WidgetBrands.BrandName
From Widgets Inner Join WidgetBrands on Widgets.Brand_Id = WidgetBrands.Brand_Id;

I think we take for granted SQL in this case and we want it in a more usable way. Maybe this is the right direction to go.

Propose

Propose a SubqueryUDF. Essentially you could define the UDF by mapping input parameters to a subquery’s filter/join criteria. The optimizer would unwrap the UDF into the subquery. This functionality would allow a more streamlined way of using SQL. It would reduce total text, increase readability, and appeal to wide variety of users.

Example

The SQL is submitted as such:

Select Widget_Id, Get_Widget_Brand_Name(Widget_Id) From Widgets

The SQL is treated by the optimizer as such:

Select Widgets.Widget_Id, WidgetBrands.BrandName From Widgets Inner Join WidgetBrands on Widgets.Brand_Id = WidgetBrands.Brand_Id;

 

Core Database Functions

I work with databases probably more than the next guy. Along those lines I spent a few minutes to compose a list of functions that I think are a healthy starting point for “core functionality.” There are some things so basic I did not bother to include them (like trim). I think the list below can get the average developer pretty good mileage and it would be in the interests of most database vendors to support this list.

greatest
least

timestamp_to_epoch
time_to_epoch
date_to_epoch
epoch_to_date
epoch_to_time
epoch_to_timestamp

to_timestamp(value,format)
to_char(value,format)
to_date(value,format)
to_time(value,format)

bitwise_shift_left
bitwise_shift_right
bitwise_and
bitwise_or
bitwise_xor
bitwise_not

ascii
chr

quote_literal
quote_ident

repeat

to_hex
from_hex
to_base64
from_base64

get_byte
set_byte
get_bit
set_bit

md5

sprintf

translate
replace

sleep

row_to_json
json_to_row

get_json_object
set_json_object

row_to_xml
xml_to_row

get_xml_object
set_xml_object

concat
concat_ws

uuid
guid

if

reverse

 

regex_replace

regex_match

[aggregrate]
bit_and
bit_or
bit_xor
bit_not
concat
concat_ws

[table function]
sequence
build_ngrams
top_ngrams
sentences
regex_split
strtok_split
string_split

ggplot2 blog

I see Hadley Wickham has launched a new ggplot2 blog.

I got to meet Hadley for a day long tutorial session and later share dinner. I am truly impressed by what he has delivered and by how he continues to foster its growth.

If you are not familiar with ggplot2, it is a graphical package for the R project.

Personally, ggplot2 was my gateway into the land of R and I still very much use it almost daily. For me, it provided the building-block type of functionality that allowed me to gradually learn. Its strength for me was how readily I could re-use each building-block that I learned.

All in all, I look forward to seeing ggplot2 grow and I’m thankful for Hadley’s contributions.

Profiling Thread Activity

I’ve been working on a Java program for a little bit.

In effort to understand what it’s doing I’ve been monitoring the host network and CPU activity. I know what I’m looking for but it’s not always clear why I’m not achieving maximum CPU or network consumption.

To help understand I tried period stack dumps using jstack looking for shared object contention. I found a few items there so that was good.

I also tried using generic Java profiling where it shows which methods are using the largest percentage of CPU. That was semi-helpful but I still felt like I need more information.

So I started looking for something that would show me thread wait/monitor/run behavior. I was wanting to verify my assumed program behavior with what was really going on. What I found to help me in this case was the Java utility jvisualvm. I attached a couple screenshots below.

Ultimately I ended up enabling remote JMX connections then started up $JDK_HOME/bin/jvsiualvm and attached to the running process. From there I went to the thread view and began running tests. The results were very helpful to me in understanding thread behavior taking into account wait/monitor time.

What I show in these two screenshots is basically different behavior between data consumers and producers. It adequately verified assumed behavior with measured behavior.

What is your biography?

How would you write your biography in 150 words or less? A simple enough exercise but I think one that can help gauge where you are with your career in regards to where you want to be.

Ok, well that’s that. But I was thinking what are some incredible yet short biographies, here’s one:

George Washington: A successful farmer turned general. Led the fledgling American Revolutionary Army to defeat one of the most powerful nations of the time. Helped draft the Constitution then continued on to become the first president of the United States of America.

Sounds pretty awesome and that’s probably just thirty words. My biography on the other hand is not so noble. But I can aspire…

 

 

80 Cores Already?

Seems like last year I was excited to see servers with 24 cores (dual 6 core processors with hyper-threading) and 96GB of ram.

I was logged onto a server yesterday and checked it out, 80 cores and 1TB ram. First I had make sure that was possible but its, four 10 core processors with hyper-threading — new Intel E7.

Just took me by surprise thinking the difference in one year. Wonder what surprises are in store for next year?

Follow

Get every new post delivered to your Inbox.

Join 25 other followers