Kentucky

For those who don’t know, we’ve moved to Kentucky to follow work and ultimately take the next steps toward a career in Data Science. So far we’ve been very impressed by the local upbeat attitude and general “niceness” of Lexington. The city definitely has a small town, family-oriented feel while offering a lot of entertainment. Beyond the living conditions, work as a Data Warehouse Analyst has been very fruitful. I was able to jump right in and contribute to the release spiral, authoring an ETL+Sharepoint synchronization solution using SSIS in SQL 2008 R2. SSIS is a fantastic product coming from an exclusively T-SQL project. Now we are in the throes of requirements gathering, data modeling, and determining areas for refactoring which of course means meetings, meetings, meetings. Agile is a great methodology but I’m learning just how quickly stakeholder meetings can generate a deluge of teleconferences!

Spring is (perhaps) right around the corner, and I am excited to see Kentucky’s beauty. I’m hopeful Spring will have sprung by next weekend when I hit my first Kentucky trail.

Happy trails and good hunting.

32-Bit Lamentations

The migration from 32-Bit SQL Server to 64-Bit has completed after ~6 months of red tape (government contracts, arg), cryptic SMTP errors, and a hardware failure that caused the entire server to be rebuilt and reapportioned. Now for the real work – to see what weird differences the data warehouse can illuminate!

The most apparent difference is a positive impact on ETL due to more efficient handling of the larger, ~2Gb debug logs as well as the massive index building that occurs in the temp and look up tables for all the logs we process. That equates to 2 hour gains in the ETL+reporting processes, in turn allowing expansion of the data warehouse for future log types.

One peculiar benefit of 64-bit is the proper(?) handling of wide columns defined in BCP format files when doing a bulk import. We occasionally have log entries from attempted buffer overflow attacks which can exceed various ETL parameters. Over time we have found a good balance between efficient data type lengths and leaving enough room for such attacks, but exceptions do arise. In this particular case I noticed a data source causing SQL Error 4864, “bulk load data conversion error” on both 64-bit and 32-bit servers. After some research I decided to check the BCP format file used in this import and noticed a SQLCHAR with 8000 length. Usually 8000 length is a default large value in our settings, so I decided to increase it to 999,999,999 on both servers. Curiously this alleviated the error on the 64-bit server but not on the 32-bit server! I presume this may have to do with the rampant memory mishandling issues reported in 32-bit SQL 2005, but I am surprised 32-bit wasn’t able to handle longer strings on import. Please enlighten me if you know the specifics, it is appreciated!

I plan to update this post with any other oddities that manifest while subjecting SQL 2005 to the “extreme” conditions our data sources generate. Feel free to share your findings in comments below.

Children and Children’s

Fatherhood – what a ride! Our amazing daughter entered the world in July and has kept us busy since then. We had a minor health scare that involved a trip to Children’s hospital, but all of the doctors’ concerns have resolved.

My work has shifted gears, or perhaps evolved, into being more of a manager of the stats my data warehouse produces. The closest ITIL equivalent to my role is the Service Level Manager position. Basically I watch over and resolve any service level metric infractions. I’m not sure how excited I am about the uptick in meetings and paperwork, but I feel this is the natural progression for the work I’ve done and is a testament to the maturity of the data warehouse. Not that technical work is completely out of the picture; I plan to work with BIRT here and there over the next year, as well as expand the data warehouse to additional consumers.

I don’t expect many posts over the next 6 months as I am focusing on extracurricular, de-stressing activities like hiking more sections of the Pinhoti trail or getting back into soccer.

Until next time…

Early Spring / New Tasks

A new year, many new tasks, and another life change. First and foremost, we’ll be welcoming baby #2 into the world sometime early August! We’re definitely excited and are working hard to get our house sold and all the Bowdens’ moved asap. Beside the insanity of moving a toddler and pregnant wife in the middle of what will likely be an epic-hot summer, I’ve got quite a few professional goals for this year.

First, to migrate and complete documentation of the data warehouse I work on, taking the system from a hodge-podge of TSQL scripts and jobs to robust SSIS packages. Second, to finish studying for MCTS: SQL Developer (really!) and get the cert under my belt. Third, to position the data warehouse to take advantage of SQL 2008/2010 which is arriving soon in our environment. All of this on top of my usual deliverables, of course.

Right now I’ve added some more SPC tasks on my plate, to top it off! I’m really enjoying doing linear regressions and forecasting/trend analysis with SQL. Moreover, seeing a real-world example play out nicely with the aggregations I’ve set up has been very encouraging as I grow as a data warehouse developer.

Sec+, BI vs. Dev

Well I managed to finally get another Project Euler done just in time to claim more than 1 problem per year! Now that my work transition to Lockheed has settled down and my CompTIA Security+ cert is under my belt, I “should” have more time to work on Project Euler problems. One great thing about Lockheed is access to Books 24×7, which contains all the MCTS exam study materials. Browsing through them I quickly realized the BI cert I was pursuing is definitely not the most closely aligned with my database development tasks – the database developer cert is, duh! So I’m going to be heads down to get MCTS: SQL Development done early next year, and move on to the MCITP after that.

Merry Christmas to you all in case I don’t make another post before 2012!

Project Euler #7

Problem 7 asks: “What is the 10,001st prime number?”

I spent a lot of time trying to work out a way to determine a range of numbers small enough to yeild any nth prime number, and started to write up the paragraphs that follow. What I found was that primes are tricky and will betray you! Joking aside, I did find some interesting “nth prime” approximation information that is worth sharing:

There are quite a few ways to go about approximating the number of primes less than a sufficiently large number. Check out Wolfram’s entry on prime counting functions: http://mathworld.wolfram.com/PrimeCountingFunction.html. If we know some n, then we know approximately how many primes are less than n by the following approximation

\dfrac{n}{\ln(n)}\, < \pi(n).

This is the flip side to what we really need to know, and try as I might I could not get the algebra to work out to determine which n fulfilled n/ln(n) = 10,001. However, I was pretty certain this was the idea of what I was looking for, so I searched again for known prime inequalities. Again, Wolfram served up useful info: http://functions.wolfram.com/NumberTheoryFunctions/Prime/29/0002/. This is the relationship I was looking for:

n(\ln(n) + \ln(\ln(n) - 1)\, < P_n < \, n(\ln(n) + \ln(\ln(n))).

Using this inequality we can narrow down our search for the 10,001st prime to a manageable window of possibilities.

Unfortunately the window of primes was quite large no matter which inequality I used (there are tighter bounds known), and since I hadn’t iterated to get my shortcut I had no way of knowing which prime was the 10001st. Nonetheless, knowing the upper bound of the 10001st prime will let me use an algorithm from an early Project Euler problem – the Euclid/Erastothenes Sieve! Codepad is throwing timeout errors, but you can see the code, sans results, here: http://codepad.org/UbhpCbQW

Update: I made some minor code changes that turned into significant performance boosts, but Codepad is still timing out. Here’s the new paste: http://codepad.org/jTHJ5gkY

Changing Hats

The usual contractor game of musical chairs has come and gone, leaving me thankfully employed but with a new company. Security+ and MCITP: SQL BI Developer certification testing is on the horizon now that things have settled down. So for something technical: today I was tuning an ETL process and discovered some major performance increases by changing the import method from using OPENROWSET BULK to BULK INSERT. My testing log (1.3G) imported with BULK IMPORT in about two thirds the time of the old OPENROWSET method. Admittedly, the gain was somewhat fabricated – why in the world was I trying to use a batch-less importer (OPENROWSET) to import 1.3G files?! When run times skyrocketed, I checked the application log to find numerous errors related to excessive paging. After reading some articles about performance tuning on MSDN I decided BULK IMPORT’s batching feature was what I needed to tame the excessively large files.