CloudPass LogoCloud Pass
AWSGoogle CloudMicrosoftCiscoCompTIADatabricks
Certifications
AWSGoogle CloudMicrosoftCiscoCompTIADatabricks
Google Professional Machine Learning Engineer
Google Professional Machine Learning Engineer

Practice Test #3

Simulate the real exam experience with 50 questions and a 120-minute time limit. Practice with AI-verified answers and detailed explanations.

50Questions120Minutes700/1000Passing Score
Browse Practice Questions

AI-Powered

Triple AI-Verified Answers & Explanations

Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.

GPT Pro
Claude Opus
Gemini Pro
Per-option explanations
In-depth question analysis
3-model consensus accuracy

Practice Questions

1
Question 1

You deployed a TensorFlow recommendation model to a Vertex AI Prediction endpoint in us-central1 with autoscaling enabled. Over the last week, you observed sustained traffic of ~1,200 requests per hour (about 20 RPS) during business hours, which is 2x higher than your original estimate, and you need to keep P95 latency under 150 ms during future surges. You want the endpoint to scale efficiently to handle this higher baseline and upcoming spikes without causing user-visible latency. What should you do?

Deploying a second model to the same endpoint is mainly for A/B testing, canarying, or gradual rollouts via traffic splitting. It does not inherently guarantee more serving capacity or lower P95 latency unless you also increase total replicas. It also adds operational complexity (two model versions, monitoring two deployments) without directly addressing the need for warm baseline capacity.

Setting minReplicaCount ensures a baseline number of replicas are always running and ready to serve, preventing cold-start/warm-up delays and reducing queueing during predictable business-hour load. This is the standard approach to protect tail latency (P95) when traffic is consistently higher than expected. Autoscaling can still add replicas above the minimum for future surges.

Increasing the target utilization percentage delays scale-out, meaning each replica runs hotter before new replicas are added. That can increase queueing and push P95 latency above the 150 ms SLO during spikes. While it may reduce cost by using fewer replicas, it conflicts with the requirement to avoid user-visible latency during surges and is generally the opposite of what you want for strict latency targets.

Switching to GPU-accelerated machines can reduce inference time for compute-heavy models, but it’s not the first lever for a scaling/latency issue caused by insufficient warm capacity. GPUs increase cost and may have quota/availability constraints in us-central1. If the model already meets latency when enough CPU replicas are warm, GPUs won’t solve autoscaling reaction time or cold-start effects.

Question Analysis

Core Concept: This question tests Vertex AI Prediction online serving autoscaling behavior and how to meet latency SLOs under a higher steady-state load. Key knobs are minimum/maximum replica counts and autoscaling signals (utilization/QPS), which directly affect cold-start risk and tail latency. Why the Answer is Correct: With a sustained new baseline (~20 RPS during business hours) that is 2x the original estimate, relying purely on reactive autoscaling can cause periods where too few replicas are available, leading to queueing and cold-start/warm-up delays that inflate P95 latency. Setting minReplicaCount to match the new baseline ensures enough replicas are always provisioned (“warm”) to absorb steady traffic and small surges immediately. Autoscaling can still add replicas for larger spikes, but the floor prevents user-visible latency while new replicas start. Key Features / Best Practices: - Configure minReplicaCount based on observed steady-state throughput per replica and latency headroom. Use load testing to determine safe RPS/replica at P95 < 150 ms. - Keep maxReplicaCount high enough for anticipated surges; ensure quotas (CPU/GPU, regional) support it. - Monitor endpoint metrics (latency percentiles, replica count, utilization, request backlog/errors) and adjust. - This aligns with Google Cloud Architecture Framework reliability and performance principles: provision for predictable load, autoscale for variability, and design to meet SLOs. Common Misconceptions: It’s tempting to “scale later” (option C) to save cost, but that worsens latency during spikes. Similarly, adding another model (option A) doesn’t increase capacity unless it results in more replicas, and it complicates routing/versioning. GPUs (option D) can reduce per-request latency for some models, but they don’t address cold-start and may be unnecessary/costly for a TensorFlow recommender that already meets latency when adequately provisioned. Exam Tips: For online prediction, when you see sustained baseline traffic plus strict tail-latency targets, think “set min replicas to cover baseline” and “autoscale for spikes.” Use GPUs only when profiling shows compute-bound inference and CPU can’t meet latency at reasonable replica counts. Always consider warm capacity, scaling reaction time, and quotas in the chosen region.

2
Question 2

You are part of a data science team at a ride‑sharing platform and need to train and compare multiple TensorFlow models on Vertex AI using 850 million labeled trip records (≈2.3 TB) stored in a BigQuery table; training will run on 4–8 workers and you want to minimize data‑ingestion bottlenecks while ensuring the pipeline remains scalable and repeatable. What should you do?

Loading 2.3 TB into a pandas DataFrame is not feasible (memory limits, single-node bottleneck) and does not scale to 4–8 distributed workers. tf.data.Dataset.from_tensor_slices() is appropriate for small in-memory datasets or prototypes, not production-scale training. This approach also makes repeatability and fault tolerance difficult because the dataset must be rebuilt in memory each run.

Exporting to CSV in Cloud Storage improves decoupling from BigQuery, but CSV is inefficient for ML training at this scale. Text parsing is CPU-heavy, files are larger than binary formats, and schema/typing issues are common. While tf.data.TextLineDataset can be parallelized, overall throughput is typically worse than TFRecord, increasing the risk of input bottlenecks on multi-worker training.

Sharded TFRecords in Cloud Storage are a best-practice format for high-throughput TensorFlow training. Sharding (e.g., 1–2 GB) enables parallel reads across workers, reduces single-file contention, and supports repeatable experiments by reusing the same immutable dataset version. Using TFRecordDataset with parallel interleave and prefetch overlaps I/O and compute, minimizing data-ingestion bottlenecks and improving scalability.

Streaming directly from BigQuery during training can create ingestion bottlenecks due to BigQuery read throughput limits, concurrency/quotas, and per-request overhead, especially with multiple workers. It also couples training stability to BigQuery availability and query performance, reducing repeatability. While TensorFlow I/O can work for smaller datasets or experimentation, for TB-scale multi-worker training it is generally safer to materialize to Cloud Storage in an efficient format.

Question Analysis

Core concept: This question tests scalable input pipelines for distributed TensorFlow training on Vertex AI. The key is decoupling training from the source system (BigQuery) and using an efficient, parallelizable file format and tf.data best practices to avoid input bottlenecks at multi-worker scale. Why the answer is correct: With 850M rows (~2.3 TB) and 4–8 workers, streaming directly from BigQuery or materializing into a single-node structure will bottleneck on network, per-request overhead, and/or BigQuery concurrency limits. Sharded TFRecords in Cloud Storage are a standard, repeatable “training-ready” dataset format: they enable high-throughput sequential reads, easy parallelization across workers, and deterministic reuse across experiments. Proper sharding (e.g., 1–2 GB) balances metadata overhead (too many small files) against parallelism (too few large files). Using tf.data.TFRecordDataset with parallel interleave, map, and prefetch allows overlapping I/O and compute, maximizing accelerator/CPU utilization. Key features / best practices: - Store training data in Cloud Storage in a binary, splittable format (TFRecord) with compression (e.g., GZIP) when appropriate. - Use many shards and let each worker read different shards (via file patterns and dataset sharding options) to reduce contention. - Use tf.data optimizations: parallel_interleave (or Dataset.interleave with num_parallel_calls), map with AUTOTUNE, prefetch(AUTOTUNE), and optionally cache only when it fits. - Make the pipeline repeatable: a one-time (or scheduled) export/transform step from BigQuery to TFRecords can be orchestrated (e.g., Vertex AI Pipelines / Dataflow) and versioned. Common misconceptions: - “Directly read from BigQuery” sounds convenient, but it couples training throughput to BigQuery read performance, quotas, and transient query/streaming behavior, which is risky at scale. - “CSV is universal” but is inefficient: large text parsing overhead, larger storage footprint, and slower input pipelines. - “Load into pandas” is a common prototype pattern but fails for multi-terabyte datasets and distributed training. Exam tips: For large-scale training on Vertex AI, prefer Cloud Storage + TFRecords (or similarly efficient formats) with tf.data performance patterns. Choose architectures that separate data preparation from training, support multi-worker parallel reads, and minimize per-record parsing overhead. When you see TB-scale data and multiple workers, avoid pandas and avoid text formats unless explicitly required.

3
Question 3

You are designing a TensorFlow Extended (TFX) pipeline with standard TFX components for a global media-streaming platform that analyzes user interaction logs; the pipeline includes feature engineering and data validation steps, and after promotion to production it must process up to 120 TB of historical clickstream data per day stored in BigQuery across 12 daily partitions (with an additional 2 TB ingested each day); you need the preprocessing steps to scale efficiently, automatically publish metrics and parameters to Vertex AI Experiments, and track all artifacts with Vertex ML Metadata. How should you configure the pipeline run?

Vertex AI Pipelines is the right orchestrator, but configuring distributed Vertex AI Training jobs addresses model training scale, not the heavy preprocessing/validation workload. TFX preprocessing components (Transform, StatisticsGen, ExampleValidator) are Beam-based and need a scalable Beam runner. Without explicitly running Beam on Dataflow, preprocessing may run locally or on limited resources, becoming the bottleneck at 120 TB/day.

This matches the intended architecture: orchestrate the standard TFX pipeline with Vertex AI Pipelines (managed orchestration + ML Metadata integration) and configure Apache Beam pipeline_args so Beam-based components execute on Dataflow. Dataflow provides autoscaling and distributed processing suitable for TB-scale BigQuery partitions. This best satisfies efficient scaling for preprocessing while keeping Vertex AI Experiments/MLMD integration aligned with Vertex AI Pipelines runs.

Dataproc can run Spark/Hadoop workloads and can host custom orchestration, but using a Beam TFX orchestrator on Dataproc is not the standard managed path for TFX on Google Cloud. It increases operational overhead (cluster lifecycle, scaling, upgrades) and weakens the “automatic” integration story with Vertex AI Pipelines, Experiments, and ML Metadata compared to the native Vertex AI Pipelines + Dataflow runner approach.

Running the TFX orchestrator itself on Dataflow is not the typical or recommended pattern. Dataflow is designed to execute Beam pipelines (the data processing steps), not to serve as the primary orchestrator for multi-step ML pipelines with artifact lineage and experiment tracking. This option also risks losing the managed orchestration, UI, and MLMD-first integration that Vertex AI Pipelines provides.

Question Analysis

Core concept: This question tests how to run a standard TFX pipeline on Google Cloud so that (1) large-scale preprocessing/validation can elastically scale, and (2) the run is natively integrated with Vertex AI for orchestration, Experiments tracking, and ML Metadata artifact lineage. Why the answer is correct: Vertex AI Pipelines is the managed orchestration layer for TFX on Google Cloud and integrates with Vertex ML Metadata for artifact tracking. For preprocessing and validation at the stated scale (up to ~120 TB/day across partitions in BigQuery), the critical requirement is to execute TFX’s Beam-based components (e.g., ExampleGen with BigQuery, StatisticsGen, SchemaGen, ExampleValidator, Transform) on a scalable distributed runner. Configuring Apache Beam pipeline arguments to use the Dataflow runner is the standard, best-practice approach: Dataflow provides autoscaling, parallelism, and managed execution for Beam, which is exactly what these components use under the hood. Key features / configurations: - Orchestrate with Vertex AI Pipelines (managed Kubeflow Pipelines) to get MLMD integration and reproducible runs. - Set Beam pipeline_args for Dataflow (runner=DataflowRunner, project, region, temp_location, staging_location, service_account, network/subnetwork if needed, worker settings, autoscaling). This ensures Transform/validation steps scale to TB-scale data. - Use the Vertex AI Pipelines/TFX integration to publish run parameters and metrics; Experiments tracking is best achieved when the pipeline steps log metrics/params to Vertex AI (often via built-in integrations or custom components) while MLMD captures artifacts and lineage. Common misconceptions: Option A sounds plausible because “distributed training” is important, but the bottleneck described is preprocessing/validation over massive BigQuery data, not model training. Distributed training does not automatically scale Beam-based data processing. Options C and D propose using Beam orchestrators directly on Dataproc/Dataflow, but that bypasses the primary requirement of using standard TFX components with Vertex AI Pipelines’ managed orchestration and tight MLMD/Experiments integration. Exam tips: - For TFX on Google Cloud: Vertex AI Pipelines is the orchestrator; Dataflow is the scalable runner for Beam-based TFX components. - When you see TB-scale feature engineering/validation, think “Beam on Dataflow,” not “bigger training jobs.” - Map requirements to layers: orchestration (Vertex AI Pipelines), data processing (Dataflow), metadata/lineage (Vertex ML Metadata), and experiment tracking (Vertex AI Experiments).

4
Question 4

You are part of an operations team managing a fleet of 250 refrigerated delivery trucks. Each truck’s refrigeration unit streams telemetry at 10-second intervals, including compressor current (A), condenser coil temperature (°C), discharge pressure (kPa), and vibration RMS (g), resulting in roughly 14 months of historical data per truck. No breakdowns or incident events have been hand-labeled yet. Management asks for a predictive maintenance solution that can detect potential refrigeration unit failures with at least a 24-hour lead time so that routes can be rescheduled. What should you do first?

Correct. Forecasting-based anomaly detection works without labeled failures by learning normal temporal behavior and alerting on large residuals. It provides immediate value, supports a 24-hour lead time by forecasting ahead, and creates a pipeline to surface candidate incidents for later confirmation/labeling. It is a standard first step in predictive maintenance when only telemetry exists.

Not best as a first step for a predictive maintenance solution. Pure heuristics can be deployed quickly, but they are brittle, require constant tuning across trucks and seasons, and often generate many false positives/negatives. They also don’t leverage temporal dynamics well and may not provide reliable 24-hour early warning beyond simple threshold breaches.

Tempting, but risky. Training on heuristic-generated labels usually causes the model to replicate the heuristic rather than learn true failure precursors. This can create a false sense of model quality because offline metrics reflect the heuristic, not real failures. It can be useful later as weak supervision, but only after establishing validation with real confirmed events.

Impractical and not necessary as a first step. Manually labeling 14 months of high-frequency telemetry for 250 trucks is expensive, slow, and ambiguous without clear definitions of “failure” and lead-time windows. A better approach is to start with unsupervised detection to triage and prioritize a much smaller set of segments for expert review.

Question Analysis

Core concept: This question tests how to start a predictive maintenance initiative when you have abundant telemetry but no labeled failure events. In this situation, the first practical ML approach is typically unsupervised or self-supervised anomaly detection/forecasting on time-series, rather than supervised classification. Why the answer is correct: With no hand-labeled breakdowns, you cannot directly train a supervised “failure in 24 hours” classifier. A strong first step is to build a time-series forecasting baseline per signal (or multivariate) and alert on statistically significant residuals (actual minus predicted). This creates an initial detection capability and, importantly, a mechanism to generate candidate incidents for investigation and future labeling. It also aligns with the business requirement (24-hour lead time): you can forecast 24+ hours ahead and flag trajectories likely to exceed normal operating envelopes. Key features / best practices: Use a holdout period per truck and evaluate residual distributions to set thresholds that control false positives. Consider seasonality (ambient temperature, route patterns) and per-asset normalization. In Google Cloud, this is commonly implemented with Vertex AI (custom training or AutoML tabular/time-series where applicable), plus a feature store or BigQuery for historical aggregation, and Cloud Monitoring/Alerting for operationalization. Architecturally, start with a simple, explainable baseline (Architecture Framework: reliability and operational excellence) and iterate. Common misconceptions: Rule-based heuristics (B) can be tempting because they are fast, but they are brittle and often miss multivariate patterns and drift. Using heuristics to create labels then training a model (C) risks “learning the heuristic,” not true failures, producing overconfident models with poor real-world performance. Manual labeling at full scale (D) is usually prohibitively expensive and slow, especially without clear failure definitions. Exam tips: When labels are missing, prefer unsupervised/self-supervised approaches first (forecasting, reconstruction error, clustering) to bootstrap a feedback loop for labeling. Look for answers that create an iterative path: baseline detection → collect confirmed events → improve to supervised prediction. Also consider operational constraints: cost, time-to-value, and maintainability.

5
Question 5

A logistics platform has trained three versions of an ETA prediction model (v1, v2, v3), imported them into Vertex AI Model Registry, and deployed them to a single online prediction endpoint; you expect about 120,000 prediction requests per day and want to run a 7-day A/B/n test by initially routing 50%/25%/25% of traffic to v1/v2/v3 while tracking per-version accuracy and p95 latency with the least engineering overhead. What should you do to identify the best-performing model using the simplest approach?

Correct. Vertex AI Endpoints support deploying multiple model versions and configuring weighted traffic splitting (e.g., 50/25/25) for A/B/n tests. You can monitor serving metrics like p95 latency via Cloud Monitoring integrations and use prediction logging to attribute requests to each deployed model/version for accuracy analysis once ground truth is available. This is the simplest managed approach with minimal custom infrastructure.

Incorrect. GKE plus Traffic Director can do sophisticated traffic management, but it introduces significant engineering overhead: cluster operations, service mesh/proxy configuration, scaling, and security hardening. It also duplicates capabilities already provided by Vertex AI Endpoints for multi-model deployments and traffic splitting. For an exam scenario emphasizing simplicity and existing Vertex AI deployment, this is over-architected.

Incorrect as the “simplest approach.” Exporting logs and building Looker Studio dashboards can work for analysis, especially for accuracy, but it adds extra steps and ongoing maintenance (log sinks, schema management, dashboard upkeep). It also doesn’t address traffic splitting by itself; you’d still need a routing mechanism. Vertex AI’s built-in endpoint traffic splitting and monitoring reduce the need for custom dashboards for core serving metrics.

Incorrect. Cloud Run traffic splitting across revisions is useful for web services, but it requires packaging and operating your own model server per version, handling autoscaling behavior, and ensuring consistent model loading and performance. Since the models are already in Vertex AI Model Registry and deployed to a Vertex endpoint, moving to Cloud Run increases engineering effort and loses Vertex AI’s purpose-built model deployment and management features.

Question Analysis

Core Concept: This question tests Vertex AI online prediction deployment patterns: using a single Endpoint with multiple deployed models (or model versions) and Endpoint traffic splitting to run A/B/n experiments, while observing operational metrics (latency) and outcome metrics (accuracy) with minimal custom infrastructure. Why the Answer is Correct: Option A uses Vertex AI Endpoint weighted traffic splitting to route 50%/25%/25% to v1/v2/v3. This is the simplest, lowest-overhead approach because traffic management is a first-class feature of Vertex AI Endpoints: you can deploy multiple models to one endpoint and set per-deployed-model traffic percentages. For p95 latency, Vertex AI integrates with Cloud Monitoring/Cloud Logging to provide request/response and serving metrics without building a custom serving stack. For per-version accuracy, you can attribute predictions to the deployed model ID/version via prediction logging and compare against ground truth once labels arrive (common in ETA problems), minimizing engineering by leveraging managed logging/monitoring rather than bespoke routing layers. Key Features / Best Practices: - Vertex AI Endpoint trafficSplit: configure weights per deployed model for A/B/n testing and adjust gradually (supports canary-style rollouts). - Prediction logging: enable request/response logging (with appropriate privacy controls) to join predictions with later-arriving actual ETAs for accuracy by model version. - Cloud Monitoring metrics: monitor latency distributions (including p95), error rates, and throughput per endpoint and (via labels/deployed_model_id) per deployment. - Architecture Framework alignment: operational excellence (managed serving, fewer moving parts), reliability (Google-managed autoscaling), and cost optimization (no extra clusters/services). Common Misconceptions: It’s tempting to assume you must use GKE/Traffic Director or Cloud Run traffic splitting for A/B tests. Those are valid general-purpose patterns, but they add unnecessary components when Vertex AI already provides model-level traffic splitting and managed observability. Another misconception is that “accuracy monitoring” must be fully automated in Vertex AI; in practice, accuracy often requires joining predictions with ground truth later, but Vertex AI logging makes that straightforward without building a custom router. Exam Tips: When the question says “least engineering overhead” and the models are already in Vertex AI Model Registry and deployed to a single endpoint, prefer native Vertex AI Endpoint capabilities (multiple deployments + traffic split + managed monitoring/logging). Reach for GKE/Cloud Run/Traffic Director only when you need custom serving logic, non-Vertex runtimes, or advanced mesh features beyond Vertex AI’s managed serving.

Want to practice all questions on the go?

Download Cloud Pass — includes practice tests, progress tracking & more.

6
Question 6

Your supply chain analytics team plans to run 180 training jobs per day for 10 days (3 feature sets × 4 model architectures × 15 hyperparameter grids) using containerized trainers; they must log per-run metrics (AUC, F1, and loss) with timestamps and be able to query trends over time (for example, 7-day rolling averages and the top 10 configurations by mean F1 in the last 30 days) via an API while minimizing manual effort. Which approach should they use to track and report these experiments?

Vertex AI Pipelines is strong for orchestration and repeatability, but it is not an analytics datastore. While pipeline runs can record metrics, querying complex trends (7-day rolling averages, top-N by mean over 30 days across many dimensions) via the Pipelines API is awkward and not what it’s optimized for. You would typically export pipeline metadata to BigQuery or another store for analytics rather than rely on the Pipelines API for reporting.

Vertex AI Training runs the containerized trainers at scale, and BigQuery is ideal for storing per-run metrics with timestamps and rich metadata. BigQuery SQL supports window functions for rolling averages, time filtering (last 30 days), and ranking (top 10 configurations). Partitioning by time and clustering by configuration fields minimizes cost and improves performance. The BigQuery API enables programmatic access with minimal manual effort.

Cloud Monitoring custom metrics are designed for operational telemetry, dashboards, and alerting, not deep experiment analytics. High-cardinality labels (feature set, architecture, hyperparameter grid, run_id) can hit quotas and become expensive or unwieldy. Querying “top 10 configurations by mean F1 in last 30 days” and computing rolling averages across many dimensions is more natural in BigQuery than in Monitoring’s time-series model.

Workbench notebooks plus Google Sheets is highly manual and does not scale to 180 runs/day with reliable, consistent logging. Sheets lacks strong schema enforcement, lineage tracking, and efficient analytical querying for rolling windows and top-N across large datasets. It also introduces collaboration and data integrity risks. This option violates the requirement to minimize manual effort and is not aligned with production-grade experiment tracking.

Question Analysis

Core Concept: This question tests experiment tracking and time-series analytics for ML training runs at scale. The key needs are: (1) automated capture of per-run metrics with timestamps, (2) flexible querying for trends (rolling averages, top-N over time windows), and (3) API access with minimal manual effort. Why the Answer is Correct: Option B best fits because BigQuery is purpose-built for analytical queries over large, append-only datasets and supports SQL window functions for rolling averages, time-bounded aggregations (e.g., last 30 days), and ranking (top 10 configurations by mean F1). Writing one row per run (run_id, feature_set, architecture, hyperparameter_grid_id, timestamp, auc, f1, loss, plus any metadata) enables straightforward trend analysis and serving results via the BigQuery API (or a thin service layer). This aligns with the Google Cloud Architecture Framework: operational excellence (repeatable ingestion), performance efficiency (columnar analytics), and reliability (durable storage). Key Features / Best Practices: Use Vertex AI Training for containerized custom training jobs and emit metrics at the end of each run (or periodically) to BigQuery via a lightweight client or Cloud Logging sink + Dataflow/BigQuery. Partition the table by date/timestamp and cluster by model/feature identifiers to control cost and improve query performance. Enforce a consistent schema and include run lineage fields (dataset version, code version, container image digest) for reproducibility. Common Misconceptions: Cloud Monitoring (option C) is excellent for operational monitoring and alerting but is not ideal for complex analytical queries like rolling averages across many dimensions and top-N by mean over long windows; it also has cardinality and quota considerations for custom metrics. Vertex AI Pipelines (option A) orchestrates workflows, but the Pipelines API is not intended as an analytics store for arbitrary metric trend queries. Sheets (option D) is manual and not scalable. Exam Tips: When you see “rolling averages,” “top-N over last X days,” and “query via API,” think “analytics database” (BigQuery) rather than monitoring systems. Use Vertex AI for execution, but store experiment results in a query-optimized system. Also watch for metric cardinality and cost: BigQuery partitioning/clustering is a common exam best practice for time-series experiment tables.

7
Question 7

You are the lead ML engineer at a smart-meter analytics company; you trained a TensorFlow model that flags consumption anomalies, and each day at 01:00 UTC your ETL job writes the previous day’s readings (~1.5 million records, ~60 GB) as newline-delimited JSON to Cloud Storage under prefixes like gs://meter-prod/daily/2025-08-31/*.jsonl; you need to run inference over the entire daily batch with minimal manual intervention and do not require low-latency, per-request responses; what should you do?

Vertex AI Batch Prediction is purpose-built for offline scoring at scale. It can read JSONL from a Cloud Storage URI/prefix (matching the daily folder pattern) and write sharded prediction outputs back to Cloud Storage. It’s managed, scales horizontally, supports large datasets, and integrates cleanly with schedulers/orchestrators for daily runs, minimizing manual intervention and operational overhead.

Running scheduled inference on Compute Engine VMs is feasible but requires you to manage instance sizing, autoscaling, retries, logging, patching, and failure handling. You also need to build the data download/streaming logic and ensure throughput for 60 GB daily. This increases operational burden and is less aligned with managed best practices compared to Vertex AI batch prediction.

Triggering a Cloud Function per file write creates an event-driven fan-out that can be hard to control for large daily batches. Cloud Functions have execution time and resource constraints and are not ideal for heavy, long-running inference workloads. You’d also need aggregation/coordination across many files and handle retries/idempotency carefully, increasing complexity versus a single managed batch job.

Online prediction endpoints are optimized for low-latency, per-request serving. Invoking the endpoint per record for 1.5 million records adds significant request overhead, can run into quota/QPS limits, and is typically more expensive than batch prediction for large offline workloads. It also introduces unnecessary complexity (client batching, retries, throttling) when latency is not required.

Question Analysis

Core concept: This question tests selecting the right serving pattern for non-interactive, high-throughput inference. In Google Cloud, Vertex AI Batch Prediction is designed for offline/batch scoring from Cloud Storage (or BigQuery) and writing outputs back to Cloud Storage/BigQuery without building and operating custom infrastructure. Why the answer is correct: You have a daily, large batch (~60 GB, ~1.5M records) landing in Cloud Storage with a predictable prefix and no low-latency requirement. Vertex AI batch prediction can read newline-delimited JSON from a GCS URI/prefix, scale compute for distributed inference, and write predictions to a destination GCS path. This minimizes manual intervention: you schedule a recurring batch prediction job (commonly via Cloud Scheduler + Cloud Functions/Cloud Run, or via Vertex AI Pipelines/Workflows) that points at the day’s prefix and runs automatically. It aligns with the Google Cloud Architecture Framework by improving operational excellence (managed service), reliability (retries/managed execution), and cost optimization (ephemeral compute rather than always-on endpoints). Key features / configurations: - Input: GCS source with JSONL; specify instances format and schema as needed. - Output: GCS destination prefix; Vertex AI writes sharded prediction files. - Scaling: managed worker pool; choose machine types/accelerators if the model benefits. - Orchestration: schedule the job daily at 01:00 UTC; parameterize the date-based prefix. - Monitoring: job status, logs, and metrics via Vertex AI and Cloud Logging. Common misconceptions: It’s tempting to deploy an online endpoint (D) because it’s “serving,” but per-record calls add overhead, can hit QPS/quotas, and cost more for large batches. Event-driven per-file Cloud Functions (C) seems automated, but functions are not ideal for long-running, heavy compute inference and can create uncontrolled parallelism and operational complexity. Custom VMs (B) can work but violate the managed-service intent and increase maintenance burden. Exam tips: When you see “daily batch,” “GCS prefix,” “no low latency,” and “minimal ops,” default to Vertex AI Batch Prediction. Use online prediction only for interactive/low-latency use cases. For automation, pair batch prediction with Cloud Scheduler/Workflows/Vertex AI Pipelines rather than building bespoke VM fleets.

8
Question 8

You are fine-tuning a Vision Transformer classifier on 1.2 million 224x224 product images using Keras; on a single NVIDIA T4 GPU with a global batch size of 64, each epoch takes about 90 minutes, and you have already enabled tf.data prefetch(AUTOTUNE), caching, and mixed precision. You switch to a VM with 4 T4 GPUs and wrap model creation/training with tf.distribute.MirroredStrategy, making no other changes and keeping the global batch size at 64; however, the epoch time remains ~90 minutes and per-GPU utilization hovers at 30–40%. Disk throughput and input pipeline profiling show no bottlenecks. What should you do to reduce wall-clock training time?

Using experimental_distribute_dataset (or the modern strategy.distribute_datasets_from_function) is mainly relevant when you need the strategy to shard/distribute input across replicas. With MirroredStrategy and model.fit, Keras typically handles distribution automatically. Since profiling shows no input bottleneck and GPUs are underutilized due to small per-replica work, changing dataset distribution won’t materially reduce epoch time.

A custom training loop (GradientTape) can provide flexibility, but it does not inherently increase GPU utilization or reduce all-reduce overhead. If the root issue is that each replica only processes batch=16, the GPUs will still be underfed. Custom loops can also introduce extra Python overhead unless carefully graph-compiled, so this is not the best first action for performance.

Moving to TPU with TPUStrategy can accelerate training, but it’s not the targeted fix for the observed symptom. TPUs also require sufficiently large per-core batch sizes to be efficient, and migration adds operational complexity (TPU-compatible ops, input pipeline changes, XLA behavior). The question asks what to do now on 4 GPUs where utilization is low; batch scaling is the direct remedy.

Increasing the global batch size increases per-replica batch size under MirroredStrategy, raising compute per step and improving GPU occupancy. This amortizes fixed overheads (kernel launches, framework overhead, gradient all-reduce) and typically yields near-linear speedup when the model is compute-bound. After scaling batch size, also scale/tune learning rate (often linear scaling + warmup) to preserve convergence.

Question Analysis

Core Concept: This question tests distributed training performance with tf.distribute.MirroredStrategy (data-parallel synchronous training) and how global batch size affects GPU utilization and step time. In synchronous multi-GPU training, each step splits the global batch across replicas, runs forward/backward on each GPU, then performs an all-reduce to aggregate gradients. Why the Answer is Correct: With 4 GPUs and a fixed global batch size of 64, the per-replica batch becomes 16. For a ViT at 224x224, that per-step workload can be too small to efficiently saturate each T4, especially with mixed precision (faster math) and relatively high per-step overhead (kernel launch, framework overhead, and all-reduce). The result is low utilization (30–40%) and little to no speedup, matching the symptom that epoch time stays ~90 minutes. Increasing the global batch size increases per-replica batch (e.g., global 256 -> per-replica 64), improving arithmetic intensity and amortizing overhead and communication, which typically reduces wall-clock time per epoch. Key Features / Best Practices: - MirroredStrategy expects you to scale the global batch size with the number of replicas to keep per-replica batch roughly constant (strong scaling). A common rule is: new_global_batch = old_global_batch * num_replicas. - After increasing batch size, adjust the learning rate (often linear scaling rule) and consider warmup to maintain convergence. - Ensure you are not inadvertently limiting parallelism with small steps or too-frequent host/device sync points (e.g., overly chatty callbacks), but the primary lever here is batch sizing. - This aligns with Google Cloud Architecture Framework performance principles: maximize accelerator utilization and reduce per-step overhead. Common Misconceptions: - It’s tempting to blame the input pipeline and reach for dataset distribution APIs, but profiling already ruled out I/O bottlenecks. - A custom training loop rarely fixes underutilization caused by too-small per-replica batches; it can even reduce performance if not carefully optimized. - Switching to TPU doesn’t address the root cause; TPUs also need sufficiently large per-core batch sizes and can be limited by similar overhead/communication patterns. Exam Tips: When multi-GPU speedup is poor and input is not the bottleneck, check (1) per-replica batch size, (2) all-reduce/communication overhead, and (3) step overhead. For MirroredStrategy, scaling the global batch size (and tuning LR accordingly) is the most common and expected fix.

9
Question 9

You work for an online marketplace that must automatically flag product photos containing restricted brand logos; each image belongs to exactly one class (logo present vs. not present). You trained a convolutional neural network, deployed a model version to Vertex AI Prediction, and attached a model evaluation job to that version. At a softmax decision threshold of 0.50, the evaluation reports precision = 0.71, but the business requires precision >= 0.90. To increase precision by changing only the final layer softmax threshold, what should happen as a consequence of your adjustment?

Increasing the threshold to improve precision generally does not increase recall. Recall depends on capturing as many true positives as possible. A higher threshold makes the classifier stricter, so some borderline true positives will be classified as negative, reducing TP and increasing FN. That typically lowers recall rather than raising it, assuming the underlying model is unchanged.

Correct. To raise precision from 0.71 toward >= 0.90 by only adjusting the softmax threshold, you typically increase the threshold so the model predicts “logo present” less often. This reduces false positives (improving precision) but increases false negatives, which decreases recall (TP/(TP+FN)). This is the classic precision–recall tradeoff for a fixed classifier.

Raising the threshold to increase precision usually decreases the number of false positives, not increases it. False positives occur when negative images are predicted as positive. A stricter threshold means fewer positives are predicted overall, so fewer negative examples will cross the threshold and be incorrectly flagged. Increasing false positives would generally reduce precision, the opposite of the goal.

Adjusting the threshold upward to increase precision typically increases false negatives, not decreases them. False negatives are positives predicted as negative; when the threshold is higher, some true logo images with moderate confidence will fall below the cutoff and be missed. Decreasing false negatives would usually require lowering the threshold, which tends to increase recall but can hurt precision.

Question Analysis

Core concept: This question tests classification thresholding and the precision–recall tradeoff for a binary classifier (logo present vs. not present). In Vertex AI Model Evaluation, metrics like precision and recall are computed from the confusion matrix at a chosen decision threshold (e.g., softmax probability >= 0.50 predicts “logo present”). Why the answer is correct: Precision = TP / (TP + FP). The business requires precision >= 0.90, higher than the current 0.71. If you are allowed to change only the final-layer softmax threshold (not retrain the model), the primary lever is to make the classifier more conservative about predicting the positive class (“logo present”). That means increasing the threshold above 0.50 so fewer images are labeled positive. As the threshold increases, false positives typically decrease faster than true positives, which raises precision. However, because fewer examples are predicted positive, some true positives will now fall below the threshold and be predicted negative, increasing false negatives and therefore decreasing recall (recall = TP / (TP + FN)). Thus, the expected consequence is decreased recall. Key features / best practices: Vertex AI evaluation supports threshold-based metrics and curves (precision-recall curve, ROC curve) to select an operating point aligned with business goals. For high-precision requirements (e.g., compliance or restricted content), it’s common to accept lower recall and route uncertain cases to human review. This aligns with the Google Cloud Architecture Framework’s emphasis on designing for business requirements and risk management. Common misconceptions: People often assume “improving precision” also improves recall, but with a fixed model, raising the threshold usually trades recall for precision. Another confusion is thinking threshold changes alter the model’s learned parameters; they do not—only the decision rule changes. Exam tips: Memorize how threshold shifts affect FP/FN: raising threshold reduces predicted positives → FP down, FN up → precision up, recall down. Use PR curves to justify threshold selection, especially when the positive class is rare or the cost of false positives is high.

10
Question 10

You work for a nationwide e-commerce marketplace. After receiving approval to collect the necessary customer behavior data, you trained a Vertex AI AutoML Tabular model to predict the probability that an order will be returned within 30 days. You deployed the model to online prediction, and it serves about 200,000 predictions per day. Seasonal promotions and marketing campaigns may change how features such as discount_rate, shipping_speed, and product_category interact, which could degrade accuracy over time. You want to be alerted if feature interactions change and to understand which features drive the predictions, while keeping monitoring costs low. What should you do?

Feature drift monitoring with sampling rate 1 (100%) will detect shifts in input feature distributions, but it won’t directly explain which features are driving predictions or capture changes in model reliance/interaction effects. It is also the most expensive option because it logs and analyzes all online predictions (~200k/day), increasing storage and monitoring costs. Weekly cadence helps, but full sampling still conflicts with “keep monitoring costs low.”

This option improves cost by sampling only 10% of predictions and running weekly, which is a good cost-control pattern. However, it still focuses on feature drift (input distribution changes) rather than how the model uses features. If feature interactions change without large marginal distribution shifts, feature drift may not alert appropriately, and it does not satisfy the requirement to understand which features drive predictions.

Feature attribution drift monitoring aligns with the need to understand which features drive predictions and to detect changes in model reliance over time. However, sampling rate 1 is unnecessarily costly for a high-volume online endpoint. Logging and computing attributions for every prediction increases monitoring and storage costs significantly. Weekly frequency reduces job runs, but full sampling still violates the “keep monitoring costs low” requirement.

Feature attribution drift monitoring directly addresses changing feature interactions and the need to understand drivers of predictions by tracking shifts in feature attributions over time. Using a 0.1 sampling rate reduces logging and monitoring costs substantially while still providing a large sample size given 200,000 predictions/day. Weekly monitoring is a reasonable cadence for campaign/seasonality-driven changes and further controls cost.

Question Analysis

Core Concept: This question tests Vertex AI Model Monitoring for online prediction, specifically the difference between feature drift and feature attribution drift, and how to control monitoring cost via sampling rate and monitoring frequency. Why the Answer is Correct: The business concern is that “feature interactions” change due to seasonality and campaigns, degrading accuracy. Pure feature drift detects changes in the distribution of input features (e.g., discount_rate values shifting), but it does not directly tell you whether the model’s reliance on features (including interaction effects learned by the model) has changed. Feature attribution drift monitoring tracks changes in feature attributions (e.g., how much discount_rate vs. shipping_speed contributes to predictions over time). This better matches the requirement to “understand which features drive the predictions” and to be alerted when the relationship between features and predictions changes. To keep monitoring costs low while serving ~200,000 predictions/day, you should not log/monitor 100% of requests. A sampling rate of 0.1 (10%) reduces the volume of logged predictions and monitoring computations by ~10x while still providing enough signal for weekly trend detection in a high-traffic system. Key Features / Configurations: - Vertex AI Model Monitoring supports drift monitoring on input features and on feature attributions. - Attribution drift is especially useful when the model’s decision logic changes due to changing interactions, even if marginal feature distributions don’t shift dramatically. - Sampling rate controls what fraction of prediction requests are logged and used for monitoring; lower sampling reduces BigQuery/logging/storage and monitoring job costs. - Weekly monitoring frequency is reasonable for seasonal/campaign-driven shifts and further reduces cost compared to daily. Common Misconceptions: - Choosing feature drift because “features change” is tempting, but it misses the explicit need to understand drivers of predictions and interaction/importance changes. - Setting sampling to 1 seems “more accurate,” but it is unnecessarily expensive at this scale and not required for weekly alerting. Exam Tips: When the prompt mentions “which features drive predictions,” “model reliance,” or “interactions changing,” think feature attribution monitoring (and attribution drift). When cost is a constraint for high QPS/volume endpoints, prefer lower sampling rates and an appropriate monitoring cadence rather than monitoring every request.

Success Stories(7)

C
C***************Nov 24, 2025

Study period: 1 month

Just want to say a massive thank you to the entire Cloud pass, for helping me pass my exam first time. I wont lie, it wasn't easy, especially the way the real exam is worded, however the way practice questions teaches you why your option was wrong, really helps to frame your mind and helps you to understand what the question is asking for and the solutions your mind should be focusing on. Thanks once again.

F
f****Nov 23, 2025

Study period: 1 month

Good questions banks and explanations that help me practise and pass the exam.

민
민**Nov 12, 2025

Study period: 1 month

강의 듣고 바로 문제 풀었는데 정답률 80% 가량 나왔고, 높은 점수로 시험 합격했어요. 앱 잘 이용했습니다

S
S************Nov 11, 2025

Study period: 1 month

Good mix of theory and practical scenarios

A
A***********Nov 6, 2025

Study period: 1 month

I used the app mainly to review the fundamentals—data preparation, model tuning, and deployment options on GCP. The explanations were simple and to the point, which really helped before the exam.

Other Practice Tests

Practice Test #1

50 Questions·120 min·Pass 700/1000

Practice Test #2

50 Questions·120 min·Pass 700/1000
← View All Google Professional Machine Learning Engineer Questions

Start Practicing Now

Download Cloud Pass and start practicing all Google Professional Machine Learning Engineer exam questions.

Get it on Google PlayDownload on the App Store
Cloud PassCloud Pass

IT Certification Practice App

Get it on Google PlayDownload on the App Store

Certifications

AWSGCPMicrosoftCiscoCompTIADatabricks

Legal

FAQPrivacy PolicyTerms of Service

Company

ContactDelete Account

© Copyright 2026 Cloud Pass, All rights reserved.

Want to practice all questions on the go?

Get the app

Download Cloud Pass — includes practice tests, progress tracking & more.