Wednesday, 16 September 2020

Power BI - Slowly Changing Dimension Markers, while Auto Date/Time is On

While I was working on a Power BI model, the time it took to refresh increased drastically. I’ve been working on loading historical records into a Power BI model as a Type 2 Slowly Changing Dimension. In this post I talk about why using the special null replacements commonly used in Type 2 Slowly Changing Dimensions can be a hazard with default Power BI settings.

Type 2 Slowly Changing Dimensions

If you’re not familiar with the term Type 2 Slowly Changing Dimension, it’s a modelling technique used to record copies of data as it has changed over time. In a type 2 SCD, when a change is made in the original data source, it is not updated in the copy. Instead, two additional columns are added to the copy, a “start date” and “end date” for that version of the row. When an update is made, the latest existing row is flagged as “No longer active” by setting the finish date, and a new row is added with the current date as the start date, indicating the new state of the data.

This lets us identify what was the data in the row at a certain time, and when changes were made.

Because we often want to be able to filter against both the start and end dates, for example “which customers currently live in Southland”, we don’t want a null start date or end date.

In SQL Server, I use 31st December, 9999 as the latest row’s finish date, and 1st January, 1900 as the earliest row’s start date. This means I can filter for versions of the rows that were active at a certain date and/or time.

Implementing a Type 2 Slowly Changing Dimension in Power BI

I didn’t think much of a PowerQuery step to replace null start and end date times with 01/01/1900 and 31/12/9999 respectively.

I later found the refresh time for my Power BI model felt unacceptably long on my development workstation. It’s a big model, and I put it down to recently added tables.

I’m starting to get to grips with DAX Studio, so I used the View Metrics feature of the application to see whether there were any columns in my model I could trim down in precision or distinct values, but was surprised to note the local date tables, starting with “LocalDateTable_”, consumed most of the model size, and the number of unique values in two of them were huge - the start and end dates of the type 2 SCD table.

DAX Studio told me the minimum value for the Date column of the local date table was 01/01/1900, and the maximum was 31/12/9999, and it had a very large number of rows.

An example


  1. Create a table like below.
  2. Use ReplaceValues to replace the nulls with the minimum / maximum dates
  3. Select Save and Apply.

Note that before the above replace values step is added, the data file is tiny (45KB). After, it is much larger (26MB).



Whenever we have a date or datetime column in Power BI, and have the "Auto Date/Time" feature turned on, Power BI creates a separate hidden table of dates for each date column. This is so date time intelligence, and graphs that show dates, function properly.

In my screenshot above, the range of values (with the replaced nulls with special values) are from 01/01/1900 - 10/01/2020 in one column, and 10/01/2020 - 12/31/9999 in the second. In response, it created two date tables, one with 44,195 rows, and another with 2,914,635 rows, each holding the number of days in the respective years.

I used the below query to get these figures.

EVALUATE SUMMARIZE('LocalDateTable_5f4576c5-f0ec-4966-8499-2c0f7a8b3505'

    "Rows", COUNT('LocalDateTable_5f4576c5-f0ec-4966-8499-2c0f7a8b3505'[Date]),

    "Min Date", MIN('LocalDateTable_5f4576c5-f0ec-4966-8499-2c0f7a8b3505'[Date]),

    "Max Date", MAX('LocalDateTable_5f4576c5-f0ec-4966-8499-2c0f7a8b3505'[Date])

)


The two local date tables held 196.94806MB of data, according to the “Col Size” column in View Metrics (Vertipaq Analyser) feature in DAX Studio.

Here's the before shot,


Here's the after shot.


One question that I do still have, I can't see why my table's “Version #” column is so large in the first screenshot, but so much smaller in the second - the "StartDate" and "EndDate" columns from the same table don't have that behavior. In both the before and after, this column holds only the two rows, values “a” and “b”.

With the large local date tables, the loading time of this basic dataset also went up from 3 to 9 seconds with the additional couple of millions of rows in those local date tables.

Interpretation

Okay, I’ve heard that large data models should have a custom date table, but I’ve never seen a reason why. Here it was made very clear to me - using a local date table for a large range of values will slow down my reports. I expect adding a lot of date columns will too.

Response

In response to the above, I’ve added a custom date table, marked it as the date table, and set up relationships to it from my date columns. I’m also becoming more fluent in DAX, so feel more comfortable using a role-playing (single) date table, and using deactivated relationships in my DAX expressions to use the right relationships when querying.

I’m going to still keep using LocalDateTables when I’m working with a simple dataset without large ranges of dates.

More Information

I’ve since discovered a great video on this by Guyinacube at REDUCE Power BI dataset size by 60% with ONE CHECKBOX???. See that video for information on how to turn off Auto Date/Time functionality for a report, or for all new reports you create in Power BI Desktop.

Tuesday, 14 April 2020

T-SQL Tuesday #125 - Unit testing databases - How I Found Value in Unit Testing a Complex Database Procedure


I’m writing this post as a response to Hamish Watson’s T-SQL Tuesday #125 invitation. Hamish’s question is this: is unit testing valuable for implementing in databases?

I'd like to write specifically about two places I’ve considered implementing unit testing: a complex data cleanup procedure, and the mass of stored procedures behind SQL Server Reporting Services reports. I believe that unit testing is absolutely required for a large complex process, and encourages several good practices as a side-effect. In the day-to-day development of the stored procedures behind SSRS reports, I’m less sure of the value of a comprehensive suite of unit tests.

When I started working as a junior SSRS developer nine years ago, I didn’t have any notion that unit testing was a thing. The reports ran just fine, and I fixed them where any errors occurred. There was no need to write a separate piece of code after I’d written a report to validate that it worked. Because the queries were hard-coded into the reports, there was also no way to do any sort of database unit testing, so that wasn’t a concern.

Below I’ve tried to document the phases of my relationship with unit testing my database code, and why I don’t write unit tests for the stored procedures behind SSRS reports.

Phase #1. Discovery, and my response

I’ve worked with application developers who chose not to unit test. When I learned about the importance of unit testing in software development from the outside world - a talk by Hamish Watson, YouTube, blog posts etc, 

I believed it was a given that developers should be writing unit tests. I felt like the developers I worked alongside were not fulfilling their jobs properly. I remember one instance where testing would have helped detect that a feature was nonfunctional earlier, instead of the days or weeks it took to identify. In my newfound understanding of unit testing, they were supposed to be in a position to show me how to write unit tests.

I remember one short conversation between myself and that team about unit testing and situational awareness, where one of the developers looked at me perplexed and maybe slightly disgusted. I don't remember what communication happened in that conversation, but looking back I expect the way I communicated in that conversation was not helpful, or on reflection, fair.

At some point, I realised that I needed to stop beating the drum regarding my conviction that they should be doing unit tests, but I, as a database and reports developer, didn’t need to. I suspect that was in 2017, my Google search history shows that I watched TDD for those who don't need it in that year, a talk I highly recommend, which challenges the mentality I had.

Phase #2. Failure managing a complex data cleansing project.

In 2017 I was tasked with a massive data quality improvement cleanup project. There were many duplicate records in a master table for one of our systems, with many dependencies. Several of them were in tables for areas of that system that I wasn’t familiar with, areas which I barely understood. This level of complexity was far outside of my previous experience.

First I started designing a process that merged duplicate records from the bottom-up using a batch-based approach so that it was fast, fixing all rows in each of the tables in the lowest-levels first then working up, but when I realised that sometimes the parent objects wouldn’t exist under the entry I wanted to merge a duplicate into, I stopped, threw-out the bottom-up progress, and started working top-down. 

Because I wrote much of the procedure from a top-down approach, it took a long time to see results and get any outside feedback. If one of those statements was poorly coded, or a nuance had been overlooked, I could be causing some significant damage, and I might not notice it for days, weeks, or months after the problem had been introduced, likely by reports from the end users who knew the data best.

It was seriously painful for those around me who were waiting on that project to complete - the problem was causing duplicate rows to show on the UI, prevented a process change, and required the developers implement some complex workarounds that I didn’t understand to hide the data quality issue during the months that I was working on this. The flip-flop in architecture and the delays were the result of, in part, knowing how serious this code was, and how unlikely it was that all of my code would just work when I eventually hit that execute button. This anxiety was added to by not being able to hold a picture of the whole system in my head due to its size.

During the development of this system, I became a fan of the television show Can’t Pay We’ll Take it Away, as the task that I’d already sunk so much company time into was beginning to be chronically over-schedule. I also put in some serious overtime to try to get that project complete. I need to note that I felt a significant moral obligation to get this done, rather than scared, and the overtime put in was not requested by my employer.

Phase #3. Regaining control using tSQLt.

I remember working on this project on the afternoon of December 25th, 2018, thinking working on this project was the same thing I had been doing the afternoon of that day a year earlier, and it no longer seemed cool, at all. I needed to find a better way to work. Something more sustainable, that would help me make verifiable progress. I recognised that unit testing could provide that. I need to note that this project wasn’t the only thing I did that year, this work was mostly sidelined by other tasks during business hours.

After watching the Pluralsight course Unit Testing T-SQL Code with tSQLt by Dave Green on the SQL tSQLt testing suite, I implemented tSQLt tests for all of the components of the data cleanup procedures. Some of the tests were written using TDD, and other tests written after-the-fact. This pulled up mistakes, and I fixed those, until the tests all passed. It found a good crop of mistakes early on in procedure development. I love how easy tSQLt is to use to build test suites with, and appreciated being able to see actual vs expected variations in the test outcome.

More significantly, writing unit tests gave me the confidence that I needed when it came to clicking  that green triangle on the execute button. Knowing that I had thoroughly tested the individual components gave me some confidence that it would function as a whole.

tSQLt doesn’t support linked servers, so I did need to move databases and remap synonyms to references in those databases, and there were some loopback issues that came up later in dry runs, but those were the limited final pieces on the end of the testing process.

During that project I also learned the importance of modularization, in my case using inline TVFs, to reduce code complexity, and reduce lines of code. This helped speed up the development of both the unit tests and the actual stored procedures. The practice of breaking complex stored procedures into smaller more testable blocks of code is also a good pattern that the work implementing unit testing reinforced to me.

I still chose to run the merge procedures before daylight, and over several days, but none of the code I had written and tested needed redone, and none of the critical failure I was concerned with eventuated. I was glad to see the project complete without significant data damage, If you ignore the serious schedule overrun, and a minor incident around which fields to keep in one table for a merge conflict.

Phase #4. Post-project, returning to business as usual

I work with over 200 SQL Server Reporting Services reports, and the volume of reports and the wide range of other responsibilities my team is tasked with, has resulted in limited capacity for ownership, instead falling back on taking tickets and responding with a report, without a good understanding of the reports that have already been produced. This has led to some unhelpfully specific reports, some essentially duplicate, and some reports that have not fulfilled their need for a long time, but still get enough user hits for them to show as in use and avoid being pruned away. In other terms, the reports have been regarded as somewhat disposable.

A side-effect of this is that the SQL queries behind these reports are hardcoded into the reports. At 2019’s SQL Saturday in Christchurch, a panel with Hamish Watson and Greg Low advocated unanimously for getting SQL out of hardcoded queries in SSRS reports, and into stored procedures, so they can be validated. We’ve invested some work in pulling the queries out into their own stored procedures, and getting those stored procedures under source control. Now we’re down to a very nice 0 Errors, and a few warnings in our source control project. We now get a warning if we reference an object that doesn’t exist. It is an easy way to make a change to where an external synonym points, and get quick notice if that object isn’t recognised. This helps me have some confidence the reports we do have under source control will run when requested.

Writing unit tests for all of the 200+ stored procedures behind these reports, as we interact with them, seems like unnecessary overhead. It would tell us if the output changes unexpectedly, but enough test data to handle the different edge cases and paths in the reports seems like a lot of resource that we don’t have right now. The SSDT database project warns about invalid object references, which I have found to be a great value for detecting broken reports.

We don’t often have report failures causing us to scramble, needing rework due to poorly code quality or invalid references. Users are somewhat happy to tell us when reports aren’t behaving in the way they expected, and these deviations are usually due to a difference in understanding between the user and the person who wrote the report.

I do believe knowing if code is in error is a key element of situational awareness, alongside good monitoring infrastructure. We have our reporting stored procedures in an SSDT project with no errors; this doesn’t test that the procedures are functional and operating correctly, as unit tests would, but it gets most of the way there with little effort needed to maintain. Adding unit tests would tell us if a stored procedure has deviated from known good behavior, given a known input and output, so I do see some value there.

Why I do not have unit tests for the stored procedures behind my SSRS reports

We are not facing an issue regarding the reliability of the existing report code. The issue we do have is that we don’t have coherency over our enterprise reporting solutions. Users don’t know where to find reports, or which ones to trust for their particular need. That is where we need to be investing resources right now - learning what needs each report fulfills, and learning how we can better facilitate our users’ need to find the information they’re looking for, over our various solutions.

Back when I started, when I was starting out as a database and report developer, I had nothing else on my plate. I could focus just on writing reports, maybe I could have implemented unit testing then. Now, I’m learning how to be a somewhat acceptable manager to my direct report, improving workflow, optimising use of the BI/reporting solutions, producing BI work, and managing other database-related priorities. I am also now responsible for influencing the patterns of those who work around me, so maybe I should be doing it to encourage them.


Summary

I’ve seen the wonders of unit testing with a big-bang complex project, but in the slow burn of day-to-day report development, I feel that the competing priorities need my focus more. I’d like to do unit testing of my reporting stored procedures, but I don’t think implementing this is my highest need right now.

Sunday, 29 March 2020

Getting C# Jupyter Notebooks running on Ubuntu 18.04

Why C# on Jupyter Notebooks?

I’ve heard of Jupyter, or Jupyter-like notebooks often. I've briefly evaluated the SQL Server notebooks in Azure Data Studio, and often hear of them used in data science.

The primary notebook platform, Jupyter notebook, only supports Python, and I haven’t yet invested in learning that language. This has previously been a stumbling block for me learning further about notebooks.

At a recent SQL Saturday precon, notebooks were heavily used as a part of the training material, and I got to experience enough to understand the benefit of using notebooks: annotated, byte-size, interactive, hands-on documentation.

I've been planning an introduction presentation to Azure Cosmos DB for our web developers, and Jupyter notebooks seem to be a great fit for that demo, with the exception that Jupyter notebooks use Python, not C#, the developers' native language. Why not use Visual Studio / VS Code? I don't want the web developers to be judging my console application development abilities, I want them to see how the Cosmos DB client can be interacted with in C#.

Azure Cosmos DB has a C# Notebook sample, but that appears to be broken at present, and I can’t work that out - but the idea is simple enough - it is just a notebook running C#. After a little searching on Google, I found that the .NET team have put out instructions on getting a C# kernel (runner) installed in standard Jupyter Notebooks. Here’s the steps I followed to get that up and running.

Steps to creating a C# Jupyter Notebooks environment running on Ubuntu:

Download and Install Anaconda

  • Download the Anaconda installer from https://www.anaconda.com/distribution/#linux (I chose the Python 3.7 installer) 
  • Make the script executable: Right-click on the .sh installer file, and select Properties > Permissions > Allow executing file as program. 
  • Open Terminal, navigate to the folder containing the Anaconda installer, and execute the .sh file 
  • I ran the two commands given as optional in the Anaconda installer. 
  • Run jupyter notebook from Terminal. All well, Jupyter Notebook will load. 
  • Select New, and note that only “Python 3” is an option. Additional engines, called “kernels” can be installed to add additional languages to this list. 
  • In Terminal, press CTRL+C to kill Jupyter Notebook.

Install dotnet interactive and enable C# and PowerShell support to Jupyter Notebooks

  • Install the .NET 3.1 SDK by following the Register Microsoft Key and Feed and Install the .NET Core SDK steps at Install .NET Core on Ubuntu 18.04 package manager - .NET Core 
  • Run the command to install dotnet interactive at dotnet/interactive 
  • Run the command to add the C# and Powershell kernels to Jupyter
  • Notebooks ( dotnet interactive jupyter install ) 
  • Check the C# and Powershell kernels have been added with the command jupyter kernelspec list 
  • Run the command jupyter notebook from Terminal again 
  • Select the New menu item, and you should now see “.NET (C#)” as an option, select this. 
  • Enter Console.WriteLine("Hello from C#!"); into the box beside In [ ]:, and click the Run button. 
  • The In [ ]: text should change to In [ * ]: text to indicate that block is now running.
You now have a C# Jupyter Notebook running in Ubuntu. Switch to Powershell, replace that sample with Write-Host("Hello from PowerShell!");

There we have it - below is a screenshot of the C# tutorial Tutorial: Build a .NET console app to manage data in Azure Cosmos DB SQL API account running successfully from a Jupyter notebook.