Sunday, April 26, 2015

Ignored OS Component Shows Itself

It was a new feature that was pushed into a production in short time - two days. I designed the tests for this and tested the same. And in parallel fellow testers tested it as I wanted their mind to see the short falls of my tests and the risks. All of sudden close to one month after a release, in one noon the test team received High Priority message indicating production problem and it was this feature.

Fellow testers took it up and switched to production environment and noticed it. I asked if it is reproducible consistently. And it was reproducible consistently. The next question was switch the machine and environment of client and boot to production environment and see if it is reproducible.

Now it was not reproducible in few client machines but it is observed in few client machines. In mean time, I observed the logs of servers for around an hour and did not see any change that I'm seeing from last one month.  I pushed the production branch to test environment and ran the sequences and noticed the problem on staging environment as well.  
This was a major clue for me. It indicated, there is a change on client machine which is with users and as well with programmers & testers which shows and don't show this problem. And, the same code.
Few walked to my desk and asked, "Why is this missed and we said to testing team it is a major change and it should be tested thoroughly." I had to say, "Wait, I have no clue for now other than first clue what I have." Testers seated around me were into silence. The, Technical Architect came in to desk immediately and said to all, this is well tested feature and he has tested each line of code of this feature and I'm very confident in it. I thanked him on behalf of testers. Then why the problem now was the question for which I had to say, "Wait for a day, I'm looking into it. It is a problem, it has to be fixed and no other way to live with it for now."

I'm in no mood or state of mind, to tell what is testing and what is quality to people because I can invest the same time in testing and do the useful for product, team and myself. I pick the stage selectively where I have to stand up and talk and when I have to ignore and continue the work.

Below are the questions I asked for the, Technical Architect
  1. Did we release other code than what we tested? I heard, "No"
  2. Are you sure that no other code commit is done in RC that went to product. I heard, "The tested code."
  3. Are you sure that no other packages and utilities are changed in this release. I heard, "Change are the same which test team is informed in labels."
  4. As said, I see no change or differences in server and the suspect is on client now. Any thoughts on it? I heard, "The same changes on client interface and no changes there."
I looked into the production release build's code commit label and all looked same and that indicated no code changes in that build which was pushed to production. I confirmed the same to teams. Now, studying the client machines, I started looking at machines which had differences and which did not have then.

There were no minor or major update information in OS and its associated information in client machine. The next information to explore was, to look the problem reproducible in client machine having lower version OS and the latest yet to release.  I went to the Technical Architect, and we had discussion about a suspect. We had silence and we decided to start working on it now.
During the Test Design time, the risk was indicated that can come to product from one component which the product makes use of in the client machine. And this component is part of the client's machine's OS which controls most of the products that will be installed on the client machine.  Reading the beta development changes and bugs of it at time of test design, it was also tested by installing the latest beta but it was still a beta and changes could come in until it's code and architecture design are frozen.
One of the major change for this component was in its architecture and design how it saves data on Client machine. But here, the product could not wait till these component get frozen and it went out by skipping the words of Test Design saying there will be no impact from this to our product.  Test Design's risk list clearly mentioned this but it was said 'acceptable risk' and the cost is known to us and we are agreed to it.
Test team communicated the design of release plan for the same seeing the risk to product but business could not wait for this. The result was observed in one month of time.
Now, the cost was still bearable but not anymore here because the user will not be able use the Client interface anymore that means no service served from server to user and user is blocked.  I continued exploring and it was already 5 hours had passed and I had lead sources but not yet isolated the cause.

Monitoring closely with much more tests, it was isolated that, component of the Client's OS machine is the cause. This was ignored saying it is acceptable earlier though it was mentioned in risk list of Test Design. But the question I got was why not all user is facing this problem and only part of user.
The Client OS vendor is pushing the change of this component to users in batch across the globe and not in one go. That batch of people whose component got updated in their OS, they doing that sequence of operation are getting blocked with the product. And the question is, will all the user do that? Of course, yes and not, it depends on mindset of the user. It was left to business to make informed decision now, ff business sees that user is important, then you make decision what you should be doing.
Learning the change in Client OS's component was not easy and identifying that is the source of problem when it was pushed as an update to Client OS.  The OS did not show any update changes but it was changed. The Test Design happens at every time when testing and it is not that it should happen at beginning. It is an ongoing activity. In this case, I took bit of time since I knew the architecture of product and Client OS communication process as the component future changes were in beta and the impact. Moreover it was a major release and one slight incorrect move can create trouble to users and business.

Reconfirmed the cause. The Client OS component's architecture and the changes it makes to Client interface, is the cause.  This was fixed in product to handle it and it went to production in a day.

The learning I shared to teams and my fellow testers from this Test Investigation
  • We design tests always when we test but also doing it by learning the context of changes happening outside the product, it always helps.
  • There is no good and right time to design test. Every time is time for designing the tests. In this context, it needed a chunk of time as it was a major change to product in short time. Hence I took time separately out for it.
  • We test to learn the change and its impact as well. Learning changes around our product is always useful.
  • Knowing the system architecture and supporting platform architecture is an advantage in testing.
  • Maintaining the record list of changes in system and associated system is useful no matter if it is a simple or complex software.

Saturday, April 25, 2015

Blog Framing the Isolation Tests and Test Investigation

The unusual behavior in the software and in it's associated systems is easy to identify at times. But the source which brought in that unusual behavior in software or in its associated system will not be straight to learn and isolate it. When this happens in a critical time of the release or after the critical release, imagining the uneasy conditions of stakeholders and user base using that software product, can be scary to business. No stakeholder wants to get here. Testing helps to isolate such behavior when it is supported.

Last night I joined my friends who have built a product in their start up and wanted the Release Candidate to get tested for their critical release. While I was testing along with the testers, I was asked, "there will be no problems with this release, Ravi?"  I can understand my friend's concern and I said, "problem will be there, let us keep the calm to receive them and fix it and I don't know what problem can come up apart from the context we have tested now and in the context we have tested as well.". We learned a problem and moved further to test for it before they wanted to push the patch which had fix for a problem -- because the zipping of the content was not happening when data is sent from server to client.

After the test sessions, I went along with team for food. I asked myself "why I should not work with the testing team here and tell what I tested?"  Most times, I work individually on tasks though I'm part of team. I keep updating the team what testing I'm doing and about the tests. This helps me to see how could I have done it better and have I missed any test that actually mattered in the context.

I started sharing with testers what I did while we had food.  During this talk, I thought let me write such tests that are about isolating the cause of the unusual behavior. By keeping the confidential information not coming in the blog post as much as possible to context, I will write such cases from now.  From the next post, such Testing stories will be labeled under 'Investigation' label of my blog.