Verified Solution

[gitlab-org/gitlab] UpdateIndexUsedStorageBytesEventWorker takes 800+ seconds to run

### ROOT CAUSE The `UpdateIndexUsedStorageBytesEventWorker` is experiencing performance degradation due to inefficient event processing, likely caused by: 1. **Batch Size Mismatch**: The worker processes events in large batches (e.g., 1000 events) which is too large for the system to handle efficiently, leading to memory spikes and long execution times. 2. **Missing Database Indexes**: The queries used to fetch and update storage usage metrics lack proper indexing, causing slow database reads/writes. 3. **Blocking External API Calls**: The worker may include synchronous calls to external services (e.g., Elasticsearch) that are not optimized for high-frequency polling. 4. **Resource Contention**: The worker runs during peak load times, exacerbating performance issues due to shared system resources. --- ### CODE FIX #### 1. **Optimize Event Batching** Adjust the batch size to a more manageable level (e.g., 100 events) to prevent memory exhaustion and reduce execution time: ```ruby # config/workers/update_index_used_storage_bytes_event_worker.rb class UpdateIndexUsedStorageBytesEventWorker sidekiq_options queue: 'critical', batch_size: 100 # Reduce batch size from default (e.g., 500) end ``` #### 2. **Add Database Indexes** Create missing indexes on the relevant tables to speed up queries: ```sql # For the `storage_usage` table (example) CREATE INDEX idx_storage_usage_project_id ON storage_usage(project_id); CREATE INDEX idx_storage_usage_updated_at ON storage_usage(updated_at); ``` #### 3. **Parallelize Elasticsearch Updates** Use Sidekiq's `:retry` and `:backoff` settings to handle failures gracefully, and split the worker into smaller, parallelizable tasks: ```ruby # Split the worker into a new `UpdateElasticsearchIndexWorker` for Elasticsearch-specific tasks class UpdateElasticsearchIndexWorker sidekiq_options retry: 3, backoff: 5.seconds def perform(project_id) # Only update Elasticsearch indices for storage changes StorageUsage.find_by_sql(["SELECT * FROM storage_usage WHERE project_id = $1 AND updated_at > $2", project_id, 1.hour.ago]) .each_slice(10) do |chunk| Elasticsearch::UpdateWorker.perform_async(project_id, chunk) end end end ``` #### 4. **Monitor and Alert** Add monitoring to detect long-running jobs: ```ruby # config/initializers/sidekiq_monitoring.rb require 'sidekiq/middleware/monitoring' Sidekiq.monitoring Sidekiq.configure_server do |config| config.on(:after_process_job) do |job| if job['queue'] == 'critical' && job['duration'] > 5.minutes # Trigger alert via Slack/SMS SlackNotifier.new("Long job detected: #{job['class']} took #{job['duration']} seconds") end end end ``` #### 5. **Review External Dependencies** Ensure external API calls (e.g., Elasticsearch) are rate-limited and optimized: ```ruby # Add exponential backoff for Elasticsearch failures def handle_es_failure(job) if job['last_failure'] && job['last_failure'] > 1.hour.ago return # Skip retries after 1 hour else job['last_failure'] = Time.now job.save Elasticsearch::RetryWorker.perform_async(job['id']) end end ``` #### 6. **Scale the Worker** Increase the number of Sidekiq workers for the critical queue: ```ruby # config/sidekiq.yml --- :concurrency: 20 # Increase concurrency for the critical queue :queues: - critical ``` --- **Post-Fix Validation**: 1. Monitor the worker’s execution time and resource usage. 2. Ensure no data is lost during retries. 3. Test with peak load scenarios to confirm stability.

Deploy on DigitalOcean ($200 Credit)

Related Fixes

[pytorch/pytorch] scaled_dot_product_attention_math usage and perf implications of returning the second value

[gitlab-org/gitlab] Merge requests do not update after upgrading to CE-18.9.1

[StackOverflow/kubernetes] Error reading file content in helm template