Move over SRE, we need PRE First

The Toyota Factory in San Antonio has a visitors center where they showcase the Toyota idea of the "Corporate Athlete." They have new employees work simple skill challenges like threading ropes through dowels in a certain pattern within 10 seconds, and give new employees paid time in the gym to work out to build their muscles to match the work.

Also, they pointed out that Toyota's perspective on robots was to use humans with robots, and to minimize the use of isolated robots.

Deming pointed out in the 80's that computers and robots can produce higher quality products, but at a higher cost than advanced quality frameworks like the TPS. Consider a low-end Tesla is $40K brand new vs. a low end Toyota at $20K brand new and that's what Deming was talking about.

The real secret to Toyota quality, and others who have matched them like Ford in the early 2000's and Hyunda in the decade after that is "Prevention over Inspection." This concept people talk about but don't quite understand. It's the same thing as the "Shift Left" movement in IT which is decades late. Deming taught Toyota how to think of every repeatable process step within a value stream as it's own system (System Thinking!) which has inputs, transformational throuhputs, and outputs. Each repeatable step requires it's own quality structure to be risk-managed, so that known past defects can be checked for and detected where they might re-appear. This is the true application of "Prevention over Inspection" and "Shift Left." When understood and applied correctly, the Hidden Factory of waste is detected at the point of entry and eliminated altogether.

There is lots of talk about SRE (Software Reliability Engineering) which also misses this point above about Prevention over Inspection. There are SRE platform apps today that show you problems after the fact--when you hit them, instead of helping you avoid them altogether. The only why to actually achieve SRE is through PRE (Process Reliability Engineering) which precedes SRE and will lead to high levels of SRE if done correctly.

Taken together, we call this SPRE, or Site and Process Reliability Engineering.  The Stable Framework will help you acheive SPRE.

The Stable Framework™: Empowering Information Technology Organizations to Shift Left

In today's fast-paced and competitive digital landscape, information technology organizations are constantly seeking ways to improve software quality, accelerate time-to-market, and enhance customer satisfaction. One powerful tool that enables organizations to achieve these objectives is The Stable Framework™. This new framework supports information technology organizations in adopting a "Shift Left" approach to process quality, empowering them to integrate early, test effectively, and deliver reliable software with greater efficiency.

Understanding Shift Left:

Shift Left is a software development paradigm that emphasizes early involvement of key stakeholders, such as developers, testers, security experts, and operations teams, in the development process. It involves moving tasks that were traditionally performed later in the development lifecycle closer to the beginning. The goal is to detect and address issues as early as possible, reducing the risk of costly and time-consuming fixes later in the process. We call these time-consuming downstream fixes the "Hidden Factory."
Leveraging The Stable Framework™ for Shift Left:

1.Comprehensive Monitoring and Alerting:

Using System Thinking The Stable Framework provides organizations with a robust error detection and remediation system. By continuously monitoring assets and workflow, organizations can identify anomalies or vulnerabilities before they impact the value stream. The framework's process-based toolsets enable the appropriate teams to take immediate action, investigate the root cause, and resolve issues before they escalate.

2.Integrated Testing Environment:

To effectively shift left, organizations require an integrated testing environment that allows for early and continuous testing. The Stable Framework offers guides practitioners to identify repeatable process steps and create quality structures around each. The enables organizations to perform rigorous testing throughout the development process. This ensures that bugs and issues are caught early, reducing the likelihood of critical defects reaching the later stages of development.

3.Incident Management and Root Cause Analysis:

In a shift left approach, incident management and root cause analysis are essential for early issue resolution. The Stable Framework provides incident management features that enable organizations to track, document, and collaborate on incidents from the early stages. By centralizing incident management in a central Process Asset Library within the framework, teams can quickly identify patterns and root causes, allowing them to implement fixes and prevent similar incidents from occurring in the future. This transfers tribal knowledge into institutional knowledge.

4.Collaboration and Knowledge Sharing:

Effective collaboration and knowledge sharing are vital for successful shift left implementation. The Stable Framework facilitates collaboration through its shared Process Asset Libaray, institutional knowledge, and performance console. Teams can collaborate in real-time, share insights, and leverage collective expertise to address challenges early on. The framework also allows organizations to build a centralized knowledge base, capturing best practices, incident learnings, and troubleshooting guides, promoting knowledge sharing and continual improvement.

Conclusion:

The Stable Framework™ serves as a valuable asset for information technology organizations aiming to adopt a shift left approach. By leveraging its comprehensive process definition and improvement system, asset recovery models, incident management capabilities, automation features, and collaboration tools, organizations can effectively shift left and achieve superior software quality, reduced time-to-market, and increased customer satisfaction positioning them for success in today's rapidly evolving digital landscape.

The Stable Framework™: Empowering Organizations with Process Reliability Engineering (PRE) to achieve Site Reliability Engineering (SRE)

In today's digital landscape, where uptime and performance are critical, organizations are increasingly turning to Site Reliability Engineering (SRE) to ensure the reliability and stability of their systems. SRE combines software engineering and operations principles to create scalable and reliable systems and performance workflows. SRE, when done right, requires Process Reliability Engineering (PRE). One essential tool that aids organizations in implementing PRE, and therefore SRE practices effectively is the Stable Framework™. This article explores how The Stable Framework™ provides organizations the ability to "shift-left" and focus on upstream process quality to achieve their SRE goals.

1.Building Resilient Infrastructure:

The Stable Framework serves as a robust foundation for organizations aiming to build resilient infrastructure. It provides a comprehensive set of best practices, tools, and guidelines for designing, deploying, and managing reliable systems, all of which are PRE functions. By following the principles outlined in the framework, organizations can enhance the reliability and stability of their infrastructure.

2.Monitoring and Alerting:

Monitoring and alerting are crucial aspects of SRE, allowing organizations to proactively identify and respond to incidents. The Stable Framework™ offers advanced service and application monitoring capabilities, enabling organizations to gather real-time insights into system health and service performance. By implementing monitoring practices outlined in the framework, organizations can detect issues where they occur and take corrective action before they escalate, or cascade downstream where they become more expensive to fix.  We call this unnecessary downstream flow the "Hidden Factory."

3.Incident Management and Root Cause Analysis:

When incidents occur, efficient incident management and root cause analysis are vital for minimizing downtime and ensuring a swift recovery. The Stable Framework facilitates effective incident management by providing toolset for incident tracking, collaboration, and resolution. It allows teams to streamline their incident response processes, maintain clear communication channels, and track the status of ongoing incidents. Additionally, the framework offers tools for conducting thorough root cause analysis, enabling organizations to identify the underlying issues that lead to incidents and implement preventive measures to avoid similar occurrences in the future.

4.Performance Optimization through Continual Improvement:

As organizations grow, performance stability becomes a critical challenge. The Stable Framework offers process guidance for continual improvemnet and performance optimization techniques. By leveraging these recommendations, organizations can effectively scale their systems to handle increased traffic and ensure optimal performance under varying workloads.

5.Automation and Tooling:

Automation plays a pivotal role in SRE, reducing manual toil and enabling efficient operations. The Stable Framework promotes the use of automation and system-thinking step-based quality management.. Automation practices, such as configuration management, deployment pipelines, and infrastructure provisioning, streamline operations and reduce the risk of human error.

6.Collaboration and Communication:

Successful SRE implementation requires strong collaboration and communication within and between teams. The Stable Framework emphasizes establishing effective communication channels, incident response coordination, and cross-functional collaboration. By adhering to these principles, organizations can foster a culture of shared responsibility and collaboration, ensuring smooth operations and rapid incident resolution.

Conclusion

The Stable Framework™ serves as a valuable resource for organizations embracing Site Reliability Engineering. By following the practices outlined in the framework, organizations can build resilient infrastructure, enhance monitoring and alerting capabilities, effectively manage incidents, optimize scalability and performance, automate operations, and foster collaboration. Implementing the Stable Framework empowers organizations to achieve their SRE goals, delivering reliable, highly available systems that meet the expectations of their users in today's demanding digital landscape.

 

The Hidden Factory in IT

The Hidden Factory is everything your group does over again because it didn't go right the first time around.

This ranges from re-doing a failed multi-year project, to re-pushing a production release which had some minor issues the first time around. Sometimes these activities are called "Fire Fighting."

Most groups I talk to tell me that about 35% of their teams efforts are lost to this problem.

Someone must pay for this, and it's very expensive.  Higher prices, lower wages, and lower shareholder dividends are one way to quantify the hidden factory.  In addition, the opportunity cost of not being able to reach your project monetization goals 33% faster means you left money and customers on the table.

The Stable Framework™ is a performance management framework for IT designed to give IT departments the tools needed to tame this wild Hidden Factory beast and bring the fire-fighting down to nearly zero, where it should be.

Read more about it here

Mike Berry

 

What are the Best Project Management Methodologies and Practices?

There are several project management methodologies and practices to choose from, and the best approach depends on the specific needs and goals of the project. Here are some of the most popular project management methodologies and practices:

 

  1. Agile: Agile is a flexible, iterative approach to project management that emphasizes collaboration, adaptability, and delivering value to the customer. Agile methodologies include Scrum, Kanban, and Lean.

  2. Waterfall: Waterfall is a linear, sequential approach to project management that involves completing each phase of the project before moving on to the next. It's a more traditional approach and is useful for projects where the requirements are well-defined and unlikely to change.

  3. Stable: The Stable Framework™ is an Operational Excellence model for project management and operations that can be combined with Agile, or can be performed stand-alone.

  4. RINCE2: PRINCE2 is a project management methodology that provides a structured approach to managing projects, including defined roles and responsibilities, a focus on the business case, and a step-by-step approach to project delivery.

  5. PMI's PMBOK: The Project Management Body of Knowledge (PMBOK) is a framework developed by the Project Management Institute (PMI) that provides guidelines for managing projects across a range of industries and project types.

  6. OPPM: The One Page Project Manager is a spreadsheet-based approach to Project Management.

  7. VI Sigma: Six Sigma is a data-driven methodology that focuses on improving processes and reducing defects in products and services. It's often used in manufacturing and other industries where quality control is critical.

In addition to these methodologies, there are several project management practices that can help ensure project success, including:

  • Defining clear project objectives and deliverables
  • Establishing effective communication channels and regular project status updates
  • Assigning roles and responsibilities to team members
  • Developing a comprehensive project plan and schedule
  • Identifying and managing risks throughout the project
  • Monitoring and controlling the project's progress against the plan

Ultimately, the best project management methodology and practices will depend on the specific needs and goals of your project. It's important to assess the unique requirements of the project and choose the approach that's best suited to meet those needs.

Anatomy of an Execution Plan

Have you been challenged with performing a high-risk task like upgrading a prominent server, for example?

Here's an execution plan template that you can use to guide you.

I. Executive Summary
Brief overview of intended event.

II. Review of Discovery
Details of what efforts were made to research what is listed in the following sections.  Meetings, Vendor consultations,  Online Resources, and Conventional Wisdom can be included.

III. Pre-Upgrade Procedures
Steps identified to be taken before the event.

IV. Upgrade Procedures
Steps identified to be taken during the event.

V. Post-Upgrade Procedures
Steps identified to be taken after the event.

VI. Test Plan
Verification procedures to confirm the event was a success.  This section should define the success criteria.

VII. Rollback Plan
In case the worst happens, what to do.

IIX. Situational Awareness Plan
After-the-event steps to validate the success of the event with the system's business users.  This would include a two-way communication between your group and the business users, announcing the success, and providing contact information for them to contact you in case there is still a problem.

IX. Risk-Management plan
A plan listing risks associated with the steps above and recommendations as to how to lower those risks.

X. Schedule
If the event spans many hours or days, you may want to draft a schedule for the benefit of all involved.  Include on the schedule the 'rollback point,' which would be the latest time a rollback could be successfully performed.  Your success criteria whould have to be met by this point to avoid a rollback.

Be sure the Execution Plan is in a checklist format, not a bullet-list format.  Require participants in the event to 'check' completed checklist items and sign-off sections they are responsible for.

For critical areas of high-risk, (ie: setting up replication), for example, you may want to require two individuals to perform the checklist steps and sign their names when that section is complete.

If you like, add a 'lessons learned' section to be completed later, and ke copy of the execution plan for historical purposes.

Mike J. Berry
www.RedRockResearch.com

Excellence over Heroics

I value Excellence over Heroics.

'Excellence' can be defined as "the crisp execution of established procedures."  Think about that for a minute.

Do you know of a software development shop where several prominent developers often stay up late into the night, or come in regularly over the weekend to solve high-profile problems, or put out urgent mission-critical fires?

The thrill of delivering when the whole company's reputation is at stake can be addictive.  I remember once staying up 37 hours in-a-row to deliver an EDI package for a bankers convention.  I was successful, delivering the application just before it was to be demo'd.  I went home and slept for 24 hours straight afterwards.

The problem with 'Heriocs' is that the hero is compensating for the effects of a broken process.  Think about that for a minute.

If heroes are needed to make a software development project successful, then really something upstream is broken.

Most problems requiring heroics at the end of a project stem from improper effort estimations, inability to control scope, inadequate project tracking transparency, mismanaged Q/A scheduling, unnecessary gold-plating, or inadequate communication between the development team and the project users/stakeholders.

A well-organized development group humms along like a well-oiled machine.  Proper project scoping, analysis, design deconstruction, estimating, tracking, and healthy communication between development and the users/stakeholders will bring that excellence that trumps heroics.

Hey, I hear that Microsoft is looking for some Heroes.

Mike erry
www.RedRockResearch.com

The Three P's of a Quality Management System

A Quality Management System, sometimes referred to as a Total Quality Management (TQM) System, is a simple concept that will dramatically improve software production quality over time.

Companies that don't have a quality system are commonly reacting to production and support issues due to omissive events.

A simple rule of thumb is to ask yourself how many fires your development team has put out this month.  If any come to mind, then chances are you don't have a proper quality management system in place, and should read on...

I remember early in my career I struggled to get my employees to follow our procedures.  Whenever we'd encounter a production problem with our software, it would inevitably be a result of someone not having completely followed an established procedure.

We would have a big discussion about what should have happened, and about how "we can't forget to do that next time," yet we'd experience the same omission later.

I would get frustrated because I could never seem to find a way to get my team accountable for following our established procedures--until I discovered the "Quality Management System."

A Quality Management System has the following three elements (the Three P's!):


  1. Process (documented--most of us have processes or procedures we are supposed to follow.)

  2. Proof (a separate checklist, or "receipt" that the process was followed for each software release.)

  3. Process-Improvement (a discussion, and then an addition or adjustment to the documented process.)


Most companies have an established--and hopefully documented--software development process.  (If you don't you can download one from my website for Waterfall, or Agile here.)  This is the first 'P' and should be in place at every established development shop.

A great question to ask the team is "How do you know the process was followed for each release?"  This is where you may get the deer in the headlights response.  This is the second 'P' and is the piece missing from most software development shops.

Think of this 'Proof' document as a checklist accompanying each software release.  The checklist would include every major step in the documented process, names of team members performing specific functions, and locations of final source code, test scripts, install files, etc.  The checklist would also require a series of quality checks.  Ie: Were requirements signed off by the customer, stakeholder, tester, and developer?  Was the help file updated with the new release number and appropriate functionality?  Was the source code checked in?  Where is it located?

As problems occur, the checklist would be added to so that the product would be protected against a similar failure in the future.

The governing driver considered here is that one particular problem might broadside the development team once, but after the process is improved, that problem should never occur again.

For example, you might have a stored procedure that goes into production without a "Go" statement at the end.  After the error is discovered, and fixed in production, your team should have a discussion and conclude that a checkbox needs to be added to the quality document stating "All Stored Procedures Confirmed to have 'Go' at the end."

From that point on, whenever a stored procedure is moved into production, the developer presenting it must check for 'Go' statements at the end and then sign their name at the bottom of the checklist.

This is the difference between process improvement, and hope.  Many companies view process improvement as a discussion and some verbal affirmations.  What they are really doing is "hoping."

Actually, the "act" of process improvement is physically altering a written process or procedure.  This is the real definition of process improvement--the third 'P.'

The final endpoint of a quality management system is to achieve excellence.  I've heard excellence defined once as "Crisp execution of established procedures."

You can't have excellence without procedures, proof, and process-improvement.

Mike J. Berry www.RedRockResearch.com