Home

Blog

blog

Must read blog posts

Blogs

Optimizing Server Fleet Management:
Huawei’s Tech Arena 2024 Challenge

September 2024

Optimizing Server Fleet Management with Linear Programming: Our Huawei Tech Arena Journey

In today’s fast-paced world, managing server resources efficiently across geographically distributed data centers is critical for companies that rely on cloud computing. During the Huawei 2024 Ireland Tech Arena hackathon, our team took on the challenge of developing a solution to manage server fleets at scale, ensuring cost-efficiency, high performance, and adaptability to fluctuating demand. The solution was evaluated using a scoring system, where higher scores indicated better performance. These scores were then submitted for ranking on a competitive leaderboard.

In our approach we utilized linear programming, a powerful optimization technique that allowed us to make data-driven decisions regarding server distribution, acquisition, and energy management across various data centers. We used Python’s pulp library to implement the LP model.

Problem Definition

We were tasked with managing multiple data centers, each with a distinct capacity and latency sensitivity. We had to decide how to move servers between data centers, add new servers, and remove or hold existing ones, all while meeting varying levels of demand over time. The key constraints revolved around:

Demand Satisfaction: Each data center had a unique demand for server types, which had to be fulfilled based on latency requirements.

Capacity Management: We needed to ensure that the servers deployed in any data center did not exceed its slot capacity.

Cost Minimization: Minimizing energy consumption and server purchase/removal costs was essential to ensure profitability.

Our LP Model

Our solution centered around a linear programming model that created decision variables for moving, adding, holding, and removing servers. Here’s a breakdown of how we structured the model:

1. Decision Variables

We introduced variables to capture decisions such as:

Move Variables: These indicate the number of servers moved between data centers, accounting for latency sensitivity.

Add Variables: Represent how many servers of each type should be added to each data center.

Remove Variables: Capture the number of servers removed from data centers when they are no longer needed.

Hold Variables: Denote the number of servers retained in each data center.

2. Constraints

We formulated several constraints to ensure the model was realistic and adhered to the problem’s requirements:

Capacity Constraints: These ensured that the total number of servers (in terms of their slot size) in each data center did not exceed the available slots.

Stock Balance: The stock of servers in each data center at any given time step was balanced by the previous time step’s stock, plus additions and incoming moves, minus removals and outgoing moves.

Demand Constraints: We guaranteed that the sum of added, moved, and held servers in any data center would not exceed the predicted demand.

3. Objective Function

Our objective function was designed to maximize overall profit by considering several factors, including:

Server Profitability: Each server type had a specific profit margin based on its capacity and energy consumption in different latency-sensitive environments.

Energy Consumption: We considered the energy costs associated with running servers in each data center, using the average energy cost across data centers as a baseline.

Server Lifetime Expectancy: We introduced a weighting factor for the remaining lifespan of each server, prioritizing the use of servers with longer remaining lifetimes to avoid unnecessary replacements, using a queue.

Key Features of Our Solution

Dynamic Server Allocation: The LP model dynamically allocated servers based on predicted demand, ensuring that we minimized server idle times and avoided unnecessary server purchases or removals.

Elastic Demand Handling: We incorporated an elastic demand system that adjusted server allocation based on fluctuations in demand. If a data center had an excess capacity of servers, we reallocated them to other data centers with higher demand.

Profit Maximization: The solution maximized profit by balancing server movement and energy costs, ensuring that any moves between data centers resulted in a net gain in profitability.

Scalability: The model was designed to scale with the addition of more data centers, server types, and time steps, making it applicable for real-world server fleet management.

Challenges and Solutions

1. Balancing Server Movements and Energy Costs

One of the biggest challenges we faced was determining whether it was more profitable to move servers between data centers or to purchase new ones. By incorporating latency sensitivities and regional energy costs into the model, we could calculate whether moving a server would result in a net profit increase. This helped us prioritize moves to locations with higher profit potential.

2. Meeting Fluctuating Demand

With the demand changing at every time step, our model had to adjust server allocations accordingly. We incorporated predicted demand into the constraints and allowed the model to make proactive decisions about server purchases and removals, ensuring we always met the required service levels.

3. Constraint Challenges

One of the biggest challenges I faced was managing the numerous constraints. With so many to handle, it became difficult to ensure they didn’t contradict each other. Loose constraints resulted in lower scores, and refining them required both mathematical skills and critical thinking.

Conclusion

In conclusion, by leveraging linear programming, we developed a robust and adaptable solution for managing large-scale server fleets. Our approach optimized server allocation, reduced energy costs, and met demand across multiple data centers.

What was particularly motivating was that the team who won the competition also utilized linear programming.

Although we placed 10th, the experience was incredibly rewarding. We learned a lot of different ways to tackle this problem, such as Dynamic Programming, Greedy Algorithm, Simluated Annealing, as well as some techniques to efficiently retrieve data, such as LLMs and Tableau. The knowledge gained from this competition will undoubtedly prove valuable in future projects dealing with large-scale cloud infrastructure and resource optimization.

Scroll to Top