Debugging and Troubleshooting
- How I diagnosed complex issues, identified root causes, and worked with developers to resolve them. (including the tools and techniques).
Example Google Assistant:
-
Issue: The German Google Assistant had more bugs then the French. German is closer to English than French. It should be the other way around.
-
Diagnosis:
- Initial Data Gathering: The first challenge was to understand the issue. I started by:
- Analyzing bugs (by feature): I carefully reviewed bugs, looking for common patterns (e.g., TTS, trigger issue, device type, Google account settings, location, linked services). My triage playbook experience came in handy here. (I helped creating and writing the triage section in Google’s wiki).
- German Beta tester Feedback: The next step was in-depth conversations. I used internal resources to connect to them in an attempt to find a common nominator.
- German history is not French history. Google Assitant triggered old German fear.
- How does AI learn: The next step was learning how AI learns (I started with youtube videos till I knew enough to talk to/ask developers).
- Root Cause Identification: I identified a lack of data: there was significant less user data since Germans tended to avoid talking to the Google Assistant.
- Collaboration with German Beta tester lead:
- I presented my findings in the team meeting.
- Initial Data Gathering: The first challenge was to understand the issue. I started by:
-
Resolution:
- New devices, new incentives and new testers. Resulted in resolving 90% of backlog, and 85% less bugs overall all German features.
Example Google Assistant: “My Day” Daily Brief Unexpected Content Issue
-
Issue: The Google Assistant’s “My Day” Daily Brief feature was intermittently delivering incorrect or unexpected content (e.g., wrong news headlines, outdated calendar information, missing reminders) to users. This was a complex issue because it was not consistently reproducible and seemed to be affecting different users (Munich and Mountain View) in different ways.
-
Diagnosis:
- Initial Reproducibility and Data Gathering: The first challenge was to reproduce the issue reliably. I started by:
- Analyzing User Reports (Triage): I carefully reviewed bug reports, looking for common patterns in their setups (e.g., device type, Google account settings, location, linked services).
- Reproducing on Different Devices: I tested the “My Day” feature across a variety of Google Assistant-enabled devices (smart speakers, phones, smart displays) to see if the issue was device-specific.
- Varied Network Conditions: Testing under different network conditions (Wi-Fi, cellular, different signal strengths) was important, as network latency and data transfer issues could be a factor.
- Log Analysis: The next step was in-depth log analysis. I used:
- ADB (Android Debug Bridge): To pull logs from Android-based devices. I looked for errors, warnings, and unusual patterns around the time the “My Day” brief was generated.
- Google’s Internal Logging Tools: Accessing and analyzing Google’s internal logging systems for the Assistant was crucial. I focused on logs related to:
- API Calls: Examining the requests and responses between the Assistant and the various services providing the Daily Brief content (e.g., News APIs, Calendar APIs, Reminders APIs).
- Data Processing: Analyzing logs related to how the Assistant processed and aggregated the data from different sources.
- Comparison: Then utilized my comparison testing experience to examine differences in log outputs between successful runs and failed runs.
- Root Cause Identification: After extensive log analysis, I identified a race condition in the data aggregation process. The Assistant was attempting to deliver the Daily Brief before it had fully retrieved all the data from external services. In some cases, the Calendar API was slow to respond, leading to missing calendar entries. In other cases, the News API was returning an error, resulting in incomplete news headlines.
- Collaboration with Developers:
- Clear and Concise Bug Report: I wrote a detailed bug report, including steps to reproduce the issue, device information, relevant log snippets, and a clear explanation of the root cause and the race condition. Leveraged of course my bug report writing handbook skills.
- Direct Communication: I collaborated closely with the development team, providing additional logs and answering their questions. I suggested a solution involving adding proper synchronization mechanisms to ensure that all data was fully retrieved before the Daily Brief was generated. (Idea I found on stackoverflow.com)
- Initial Reproducibility and Data Gathering: The first challenge was to reproduce the issue reliably. I started by:
-
Resolution: I think, the developers implemented a fix that included adding synchronization primitives to ensure complete data retrieval, but I am not sure. After the fix was deployed, I verified that the issue was resolved through rigorous testing.
-
Tools and Techniques:
- ADB (Android Debug Bridge)
- Google’s Internal Logging Tools Bugganizer
- Log Parsing and Analysis (grep, regular expressions, custom scripts)
- Reproducibility Testing
- Comparison Testing
- Clear Communication and Bug Reporting
Example Oculus 3: Unexpected Power Drain During VR Gaming
-
Issue: Beta testers were reporting significantly shorter battery life than expected on the Oculus3 during prolonged VR gaming sessions. This was a critical issue, as battery life is a key factor in user satisfaction with VR headsets.
-
Diagnosis:
- Power Measurement Setup: Used your hardware and power testing experience to:
- Established a controlled environment for power measurements using external power supplies and data acquisition systems.
- Monitored voltage, current, and power consumption of the Oculus 3 under various gaming scenarios. (was my idea but instead we would simply collect the data/logs from the device)
- Reproducibility and Scenario Definition: The goal was to create reproducible scenarios. I did the following:
- Selected a few graphically intensive VR games and videos: Games that would likely push the GPU and CPU to their limits. Additionally used Youtube wide screen.
- Defined standardized test sessions: Each session involved running a specific game for a set duration (e.g., 30 minutes, 1 hour) while collecting data about the power consumption nad temperature.
- Different Levels of System Usage: Experimented with varying CPU/GPU usage using different graphical settings.
- Power Consumption Profiling:
- Detailed Monitoring: Used specialized power analysis tools to create detailed power consumption profiles of the Oculus 3 during the gaming sessions. This involved tracking the power usage of different components (CPU, GPU, display, sensors) over time in the .
- Root Cause Identification: The power profiles revealed that a specific rendering technique used in certain VR games was causing unusually high GPU power consumption. The GPU was running at near-maximum clock speed even when the scene complexity didn’t warrant it, leading to excessive power drain.
- Collaboration with Developers:
- Sharing Power Profiles: Shared the detailed power consumption profiles and observations with the Oculus software developers.
- Testing Fixes: Power Management System (PMS) in a System on a Chip (SoC) was adopted which allowed me testing different fixes provided by developers.
- Power Measurement Setup: Used your hardware and power testing experience to:
-
Resolution: The developers implemented a PMS optimization that dynamically adjusted the moment when it triggers the recharging.
- Tools and Techniques:
- Hardware Power Measurement Tools: Multimeters, Oscilloscopes, Power Analyzers (developers used them, I had only access to the logs).
- Software Profiling Tools: (If available within the Oculus SDK or platform)
- ADB (for pulling system logs and monitoring CPU/GPU usage)
- VR Gaming Expertise
- Data Analysis and Visualization (to analyze power consumption data)
- Clear Communication and Presentation of Findings
DRAFT!
Testing AR/VR/MR
- Framework/application testing
- Performance and stress testing
- Test coverage and test process/methods
- ADB
- Automation using Python
Testing Power
once my home lab is ready, I will give this a try as well.
Testing Audio
Chip-level audio testing
- I2S_SPI_I2C
- SoX & install
- SoX & Cmds
- TCL
- venv devices I tested as a QA Engineer: 1 Google Assistant (e2e) testing Feature Regression Triage Comparison (deepDives) Dogfood Actions (Date Questions) DONE DONE DONE DONE My Day (DaylyBrief) DONE DONE DONE DONE Actions (Device Questions) DONE DONE DONE DONE Music (Youtube, Spotify, Pandora) DONE DONE DONE DONE News (Sports, NPR) DONE DONE DONE DONE Remote Casting (Youtube, Netflix) DONE DONE DONE DONE Actions (Misc) DONE DONE DONE DONE Radio, Podcast (Questions) DONE DONE DONE DONE Productivity Actions (Alarm, Timer, Reminder) DONE DONE DONE DONE Answers (Nutrition. Pronounce, Spelling, Web answers) DONE DONE DONE DONE Personal Answers (Calendar. Photos, Flights, and more) DONE DONE DONE DONE
- Contributed massively to the Google Assistant triage playbook.
- Created the handbooks for
- onboarding,
- flashing Google Assistant,
- writing bug reports,
- and so on. 2 Google Assistant with wearables (verification testing) Headphones, Watches, Loudspeakers, and more 3 Alexa (verification testing) Screens, Monitors, Loudspeakers, Headphones, Watches, and more 4 Google Embedded infotainment system GM, Ford, Volvo, and more 5 Fitbit 6 Pixel 5G power performance testing Feature 12V device 8V device Triage AirplaneRockbottom DONE DONE DONE Bouncy Ball DONE DONE DONE Chrome Browsing Wifi DONE DONE DONE Chrome Browsing scroll DONE DONE DONE GFX DONE DONE DONE Keepress OOB DONE DONE DONE LocalYoutube Music DONE DONE DONE Partial Wakelock DONE DONE DONE Photos (4K60FPSH265) DONE DONE DONE Photos (AVI) DONE DONE DONE Photos (MFC) DONE DONE DONE Static Display DONE DONE DONE Youtube Wifi DONE DONE DONE Youtube stream DONE DONE DONE Pixel-thermal (99 degC) Power Stats (mWs of Cellular, Modem, CPU, etc) Core » Wake Lock System > Foreground or App Wakeups Hardware Brightness GPS (Signal Quality) Audio Battery Plugged Charging Connectivity [3GPP protocol 5G/4G] Phone signal Mobile Radio Data connection Conn Change Wifi Wifi scan Data connectivity Cellular network type Cellular Rx signal strength Wifi Rx signal strength Wifi app scan Kernel Wakeup reasons CPU Usage by App Mobile Radio Activity per app App Wakeup Alarms