Category: tech


If you use BinaryComparable.getBytes(), you will need to subset that array to the length in BinaryComparable.getLength().

Otherwise you will find junk from previous rows in your byte array.

Might be worth documenting the API, just saying.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/BinaryComparable.html#getBytes()

I am getting excited about seeing large data systems employ InfiniBand. Granted, high speed interconnects have always played an important role for high performance computing. What’s exciting for me is the implications of remote memory access and diversifying of capability this provides.

For example, suppose I have a large database system that is pretty well balanced in CPU and IO. New use cases arrive that require a very large CPU draw with low IO impact. One option might be to expand the cluster but doing so would oversubscribe my IO capability at significant cost (IO is usually most expensive). Alternately, with a system configured to use InfiniBand, the software could employ remote memory access to effectively “beam” a parcel of data from my existing cluster to a new high CPU cluster which can burn through the necessary computation and “beam” back the results. Enabling a cost effective high performance co-processor capability.

Why I think this is compelling is how InfiniBand abstracts this capability into a simple implementation and the speed at which InfiniBand can perform this work.

I wouldn’t say InfiniBand is new but I do think more enterprise-ish parallel computing systems are embracing InfiniBand (Teradata, Exadata). I wondered what Hadoop might get out of InfiniBand, interestingly enough some folks already took a look at this (link). What I gleaned, Hadoop’s software doesn’t embrace the necessary protocols (NIO) to take advantage of InfiniBand from a socket perspective and its is probably a long ways away from embracing the remote memory access model.

 

 

 

 

10 hours of certification tests

I signed up for some beta certification tests. I ended up scheduling all three this week. In total, I think I spent just shy of ten hours of test taking the last three days. Today was especially mind melting. I was ready to quit on question of 160 of 230 but I finished though my focus was diminished by the end.

It would of been nice to take notes on all the things I wasn’t sure of for the noble purpose of going back and determining the correct behavior but that’s not how it works.

So I will celebrate tonight and let my brain recharge with some college football tomorrow.

Subquery UDF

Problem

Users keep thinking a SQL UDF runs SQL.

They think the SQL UDF works just like the conceptual lookup.

Select Widget_Id, Get_Widget_Brand_Name(Widget_Id) From Widgets

They don’t think to write it as SQL.

Select Widgets.Widget_Id, WidgetBrands.BrandName
From Widgets Inner Join WidgetBrands on Widgets.Brand_Id = WidgetBrands.Brand_Id;

I think we take for granted SQL in this case and we want it in a more usable way. Maybe this is the right direction to go.

Propose

Propose a SubqueryUDF. Essentially you could define the UDF by mapping input parameters to a subquery’s filter/join criteria. The optimizer would unwrap the UDF into the subquery. This functionality would allow a more streamlined way of using SQL. It would reduce total text, increase readability, and appeal to wide variety of users.

Example

The SQL is submitted as such:

Select Widget_Id, Get_Widget_Brand_Name(Widget_Id) From Widgets

The SQL is treated by the optimizer as such:

Select Widgets.Widget_Id, WidgetBrands.BrandName From Widgets Inner Join WidgetBrands on Widgets.Brand_Id = WidgetBrands.Brand_Id;

 

Core Database Functions

I work with databases probably more than the next guy. Along those lines I spent a few minutes to compose a list of functions that I think are a healthy starting point for “core functionality.” There are some things so basic I did not bother to include them (like trim). I think the list below can get the average developer pretty good mileage and it would be in the interests of most database vendors to support this list.

greatest
least

timestamp_to_epoch
time_to_epoch
date_to_epoch
epoch_to_date
epoch_to_time
epoch_to_timestamp

to_timestamp(value,format)
to_char(value,format)
to_date(value,format)
to_time(value,format)

bitwise_shift_left
bitwise_shift_right
bitwise_and
bitwise_or
bitwise_xor
bitwise_not

ascii
chr

quote_literal
quote_ident

repeat

to_hex
from_hex
to_base64
from_base64

get_byte
set_byte
get_bit
set_bit

md5

sprintf

translate
replace

sleep

row_to_json
json_to_row

get_json_object
set_json_object

row_to_xml
xml_to_row

get_xml_object
set_xml_object

concat
concat_ws

uuid
guid

if

reverse

 

regex_replace

regex_match

[aggregrate]
bit_and
bit_or
bit_xor
bit_not
concat
concat_ws

[table function]
sequence
build_ngrams
top_ngrams
sentences
regex_split
strtok_split
string_split

ggplot2 blog

I see Hadley Wickham has launched a new ggplot2 blog.

I got to meet Hadley for a day long tutorial session and later share dinner. I am truly impressed by what he has delivered and by how he continues to foster its growth.

If you are not familiar with ggplot2, it is a graphical package for the R project.

Personally, ggplot2 was my gateway into the land of R and I still very much use it almost daily. For me, it provided the building-block type of functionality that allowed me to gradually learn. Its strength for me was how readily I could re-use each building-block that I learned.

All in all, I look forward to seeing ggplot2 grow and I’m thankful for Hadley’s contributions.

Profiling Thread Activity

I’ve been working on a Java program for a little bit.

In effort to understand what it’s doing I’ve been monitoring the host network and CPU activity. I know what I’m looking for but it’s not always clear why I’m not achieving maximum CPU or network consumption.

To help understand I tried period stack dumps using jstack looking for shared object contention. I found a few items there so that was good.

I also tried using generic Java profiling where it shows which methods are using the largest percentage of CPU. That was semi-helpful but I still felt like I need more information.

So I started looking for something that would show me thread wait/monitor/run behavior. I was wanting to verify my assumed program behavior with what was really going on. What I found to help me in this case was the Java utility jvisualvm. I attached a couple screenshots below.

Ultimately I ended up enabling remote JMX connections then started up $JDK_HOME/bin/jvsiualvm and attached to the running process. From there I went to the thread view and began running tests. The results were very helpful to me in understanding thread behavior taking into account wait/monitor time.

What I show in these two screenshots is basically different behavior between data consumers and producers. It adequately verified assumed behavior with measured behavior.

80 Cores Already?

Seems like last year I was excited to see servers with 24 cores (dual 6 core processors with hyper-threading) and 96GB of ram.

I was logged onto a server yesterday and checked it out, 80 cores and 1TB ram. First I had make sure that was possible but its, four 10 core processors with hyper-threading — new Intel E7.

Just took me by surprise thinking the difference in one year. Wonder what surprises are in store for next year?

No SQL is stupid

I was reading this blog post why someone went from CouchDB to MySQL. I read this and couldn’t resist…

No SQL. It’s 2012, and most queries are run from code rather than by a human sitting at a console. Why are we still querying our databases by constructing strings of code in a language most closely related to freaking COBOL, which after being constructed have to be parsed for every single query

Why would anyone abstract a common computational process to a domain specific language?

Has this guy ever seen a COBOL program? A two file merge COBOL program does not look like an inner join in SQL.

Most databases cache requests so they aren’t parsed every time. Perhaps the author has never heard of a prepared statement.

So I’m picking on this person and perhaps to some degree out of context but ignorance breeds stupidity and I’ve suffered enough stupidity today.

 

Hadoop Secure Impersonation

As I mentioned in a previous post, I’ve been working on some import/export functionality.

One side of the fence is a database and another is HDFS.

I have an authenticated user on the database who is authorized to access some data and wants it on HDFS owned by him in his home directory. How do I propagate the authentication/authorization?

Turns out that Hadoop does support a secure impersonation feature. In some sense it’s kind of close to how this database supports a proxy impersonation.

In essence, we will configure the Hadoop cluster to recognize the UNIX process account running the middle-ware component as a super-user. We will further specify the IP’s that the proxy requests can originate from. Lastly we specify to what groups of users the proxy super user can impersonate. Of course the middle-ware components will need to invoke the security proxy.

So that’s how it’s supposed to work. In progress for setting up environment for testing. Hope things go smoothly but already know once the Hadoop cluster goes to Kerberos this will break until the middle-ware process account goes to Active-Directory which will be an issue.

Follow

Get every new post delivered to your Inbox.

Join 25 other followers