Category: tech


I am converting some primitive servlet-based web pages into something at least partially better.

I was advised to use Bootstrap which has been good.

I decided to take a small step from servlets and went to JSP pages using JSLT. So far so good.

I hit my first hurdle around pagination. I ended up using DisplayTag to help with that. And you guessed it, so far so good.

It took me some time to combine all the elements together for the pagination. Here’s my screenshot. I’ll add the code to the end of the post.

screenshot

I ended up saving my query result as a session object to help speed things up. Hopefully I don’t regret that later.

My last unsolved challenge is form processing around a bunch of optional search criteria. Will see how that goes.

Hopefully this will save someone some time if they are in similar position.

<display:table 

 name="${sessionScope.logs.rows}"
 pagesize="${rowLimit}" 
 excludedParams="refresh"
 class="table table-bordered table-condensed table-hover table-striped"
 >

 <display:setProperty name="paging.banner.no_items_found"><div class="pagination">No {0} found.</div></display:setProperty>
 <display:setProperty name="paging.banner.one_item_found"><div class="pagination">One {0} found.</div></display:setProperty>
 <display:setProperty name="paging.banner.all_items_found"><div class="pagination">{0} {1} found, displaying all {2}.</div></display:setProperty>
 <display:setProperty name="paging.banner.some_items_found"><div class="pagination">{0} {1} found, displaying {2} to {3}.</div></display:setProperty> 
 <display:setProperty name="paging.banner.full"><div class="pagination"><ul><li><a href="{1}">First</a></li><li><a href="{2}">&laquo;</a></li><li>{0}</li><li><a href="{3}">&raquo;</a></li><li><a href="{4}">Last</a></li></ul></div></display:setProperty>
 <display:setProperty name="paging.banner.first"><div class="pagination"><ul><li class="disabled"><span>First</span></li><li class="disabled"><span>&laquo;</span></li><li>{0}</li><li><a href="{3}">&raquo;</a></li><li><a href="{4}">Last</a></li></ul></div></display:setProperty>
 <display:setProperty name="paging.banner.last"><div class="pagination"><ul><li><a href="{1}">First</a></li><li><a href="{2}">&laquo;</a></li><li>{0}</li><li class="disabled"><span>&raquo;</span></li><li class="disabled"><span>Last</span></li></ul></div></display:setProperty>
 <display:setProperty name="paging.banner.page.separator"></li><li></display:setProperty>
 <display:setProperty name="paging.banner.page.selected"><span class="active">{0}</span></display:setProperty>
 <display:setProperty name="paging.banner.onepage"><div class="pagination">{0}</div></display:setProperty>

 <display:column property="log_id" title="ID" sortable="true" />
 <display:column property="log_ts" title="Timestamp" sortable="true" />
 <display:column property="log_type" title="Type" sortable="true"
 class="inputSuccess" />
 <display:column property="source_thread_name" title="Thread"
 sortable="true" />
 <display:column property="source" title="Source" sortable="true" />
 <display:column property="log_msg" title="Message" sortable="true" />
 </display:table>

 

 

So I’m involved in the broader sphere of “interactive queries.” This sphere includes established technologies like MPP databases and less established technologies leveraging Hadoop.

Regarding the Hadoop technologies, it seems Cloudera’s Impala is winning the internal marketing battle. That doesn’t mean much as the powers that be and recent momentum favor Hortonwork’s Stinger.

As I’m getting more engaged in this I get a little smarter about what these two products bring. My impression is that Impala is presently more ready however Stinger is not far behind.

Both products promote their own columnar format — Parquet and ORCfile. My sense is ORCfile offers more sophistication but I couldn’t really defend that position as I don’t know enough. To me the interesting aspect with these file formats is the encoding of data demographics or statistics the query planner can leverage. Otherwise I expect the compression / columnar effects to be roughly the same.

Both products want to execute queries in a non-MR framework. Impala can do it today while Stinger needs Hadoop 2.0 and Tez. It seems there will be work for Impala though in supporting itself within Hadoop 2.0. The effectiveness of either solution is questionable to me. I certainly think it is a step in the right direction but I question if either solution is really covering enough functionality to be intrinsically and effectively leveraged by queries in the wild.

I find it interesting how the inability to delete updates or selective deletes is a non-issue. If a database couldn’t do a delete or update we probably wouldn’t call it a database. But I guess we are talking about interactive queries and not databases. My experience suggests that building rich data sets often involves update and delete statements though.

On the MPP side, we are arguably over subscribed and subsequently have many opportunities around queuing efficiency. My Hadoop colleagues brush off the potential of this problem on Hadoop. I think a over subscribed Hadoop system will have just as many problems as a MPP database in terms of queuing efficiency. It’s unclear to me how with Tez or Impala (non-MR framework) how queues are managed.

I think Hawq is an interesting approach. I don’t know much about MapR or Hadapt but I imagine they might have some novel concepts too. The direction I look for is a single logical platform comprised of Hadoop and database technologies. I would suggest the database technologies are not Hive but a technology with MPP database origins. I think this is solvable through intelligent hardware design, a well designed data integration layer, and query optimization work to include statistics, cost based plans, predicate push down, and caching.

So I intend to get more folks directly comparing their interactive queries between MPP databases and Hadoop. I want to see the data points. I’m curious how the data points will influence their decisions on where to run their interactive queries if at all.

Any way, some thoughts in this space. And remember…

If you use BinaryComparable.getBytes(), you will need to subset that array to the length in BinaryComparable.getLength().

Otherwise you will find junk from previous rows in your byte array.

Might be worth documenting the API, just saying.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/BinaryComparable.html#getBytes()

I am getting excited about seeing large data systems employ InfiniBand. Granted, high speed interconnects have always played an important role for high performance computing. What’s exciting for me is the implications of remote memory access and diversifying of capability this provides.

For example, suppose I have a large database system that is pretty well balanced in CPU and IO. New use cases arrive that require a very large CPU draw with low IO impact. One option might be to expand the cluster but doing so would oversubscribe my IO capability at significant cost (IO is usually most expensive). Alternately, with a system configured to use InfiniBand, the software could employ remote memory access to effectively “beam” a parcel of data from my existing cluster to a new high CPU cluster which can burn through the necessary computation and “beam” back the results. Enabling a cost effective high performance co-processor capability.

Why I think this is compelling is how InfiniBand abstracts this capability into a simple implementation and the speed at which InfiniBand can perform this work.

I wouldn’t say InfiniBand is new but I do think more enterprise-ish parallel computing systems are embracing InfiniBand (Teradata, Exadata). I wondered what Hadoop might get out of InfiniBand, interestingly enough some folks already took a look at this (link). What I gleaned, Hadoop’s software doesn’t embrace the necessary protocols (NIO) to take advantage of InfiniBand from a socket perspective and its is probably a long ways away from embracing the remote memory access model.

 

 

 

 

10 hours of certification tests

I signed up for some beta certification tests. I ended up scheduling all three this week. In total, I think I spent just shy of ten hours of test taking the last three days. Today was especially mind melting. I was ready to quit on question of 160 of 230 but I finished though my focus was diminished by the end.

It would of been nice to take notes on all the things I wasn’t sure of for the noble purpose of going back and determining the correct behavior but that’s not how it works.

So I will celebrate tonight and let my brain recharge with some college football tomorrow.

Subquery UDF

Problem

Users keep thinking a SQL UDF runs SQL.

They think the SQL UDF works just like the conceptual lookup.

Select Widget_Id, Get_Widget_Brand_Name(Widget_Id) From Widgets

They don’t think to write it as SQL.

Select Widgets.Widget_Id, WidgetBrands.BrandName
From Widgets Inner Join WidgetBrands on Widgets.Brand_Id = WidgetBrands.Brand_Id;

I think we take for granted SQL in this case and we want it in a more usable way. Maybe this is the right direction to go.

Propose

Propose a SubqueryUDF. Essentially you could define the UDF by mapping input parameters to a subquery’s filter/join criteria. The optimizer would unwrap the UDF into the subquery. This functionality would allow a more streamlined way of using SQL. It would reduce total text, increase readability, and appeal to wide variety of users.

Example

The SQL is submitted as such:

Select Widget_Id, Get_Widget_Brand_Name(Widget_Id) From Widgets

The SQL is treated by the optimizer as such:

Select Widgets.Widget_Id, WidgetBrands.BrandName From Widgets Inner Join WidgetBrands on Widgets.Brand_Id = WidgetBrands.Brand_Id;

 

Core Database Functions

I work with databases probably more than the next guy. Along those lines I spent a few minutes to compose a list of functions that I think are a healthy starting point for “core functionality.” There are some things so basic I did not bother to include them (like trim). I think the list below can get the average developer pretty good mileage and it would be in the interests of most database vendors to support this list.

greatest
least

timestamp_to_epoch
time_to_epoch
date_to_epoch
epoch_to_date
epoch_to_time
epoch_to_timestamp

to_timestamp(value,format)
to_char(value,format)
to_date(value,format)
to_time(value,format)

bitwise_shift_left
bitwise_shift_right
bitwise_and
bitwise_or
bitwise_xor
bitwise_not

ascii
chr

quote_literal
quote_ident

repeat

to_hex
from_hex
to_base64
from_base64

get_byte
set_byte
get_bit
set_bit

md5

sprintf

translate
replace

sleep

row_to_json
json_to_row

get_json_object
set_json_object

row_to_xml
xml_to_row

get_xml_object
set_xml_object

concat
concat_ws

uuid
guid

if

reverse

 

regex_replace

regex_match

[aggregrate]
bit_and
bit_or
bit_xor
bit_not
concat
concat_ws

[table function]
sequence
build_ngrams
top_ngrams
sentences
regex_split
strtok_split
string_split

ggplot2 blog

I see Hadley Wickham has launched a new ggplot2 blog.

I got to meet Hadley for a day long tutorial session and later share dinner. I am truly impressed by what he has delivered and by how he continues to foster its growth.

If you are not familiar with ggplot2, it is a graphical package for the R project.

Personally, ggplot2 was my gateway into the land of R and I still very much use it almost daily. For me, it provided the building-block type of functionality that allowed me to gradually learn. Its strength for me was how readily I could re-use each building-block that I learned.

All in all, I look forward to seeing ggplot2 grow and I’m thankful for Hadley’s contributions.

Profiling Thread Activity

I’ve been working on a Java program for a little bit.

In effort to understand what it’s doing I’ve been monitoring the host network and CPU activity. I know what I’m looking for but it’s not always clear why I’m not achieving maximum CPU or network consumption.

To help understand I tried period stack dumps using jstack looking for shared object contention. I found a few items there so that was good.

I also tried using generic Java profiling where it shows which methods are using the largest percentage of CPU. That was semi-helpful but I still felt like I need more information.

So I started looking for something that would show me thread wait/monitor/run behavior. I was wanting to verify my assumed program behavior with what was really going on. What I found to help me in this case was the Java utility jvisualvm. I attached a couple screenshots below.

Ultimately I ended up enabling remote JMX connections then started up $JDK_HOME/bin/jvsiualvm and attached to the running process. From there I went to the thread view and began running tests. The results were very helpful to me in understanding thread behavior taking into account wait/monitor time.

What I show in these two screenshots is basically different behavior between data consumers and producers. It adequately verified assumed behavior with measured behavior.

80 Cores Already?

Seems like last year I was excited to see servers with 24 cores (dual 6 core processors with hyper-threading) and 96GB of ram.

I was logged onto a server yesterday and checked it out, 80 cores and 1TB ram. First I had make sure that was possible but its, four 10 core processors with hyper-threading — new Intel E7.

Just took me by surprise thinking the difference in one year. Wonder what surprises are in store for next year?

Follow

Get every new post delivered to your Inbox.

Join 29 other followers