This product can help to determine and analyse the large data

You can input any JSON-based data URL. The server is able to ingest data, and using those data, you can chat anything with those data.

This product can help to determine and analyse the large data

You can input any JSON-based data URL. The server is able to ingest data, and using those data, you can chat anything with those data.

Overview

The BV Inference Stress Server is a powerful tool designed to simulate high volumes of inference requests, helping you evaluate the performance of your server. By supporting multiple concurrent requests, it allows for stress testing, ensuring that your infrastructure can handle demanding workloads. This makes it especially useful for testing AI model deployments in real-world scenarios, giving you confidence that your systems can manage the traffic they will face in production.

One of the standout features of the BV Inference Stress Server is its customizability. Users can fine-tune parameters like batch sizes, request rates, and concurrency levels to match specific test conditions. Along with this, the system offers real-time resource utilization monitoring, tracking CPU, GPU, memory, and network usage during testing. This is critical for identifying any performance bottlenecks and helps in optimizing hardware usage.

Additionally, the BV Inference Stress Server supports scalability analysis, allowing you to see how your server performs under increased load. It can benchmark different hardware configurations, helping you identify the best setup for your needs. Automated report generation logs performance metrics and test results, providing structured insights for further optimization. With multi-model support, API-based execution, and the flexibility to deploy in both cloud and on-premise environments, this tool is versatile and powerful for improving AI infrastructure, ensuring stable deployments, and optimizing cost-efficiency.

Demo Video

Usage Guide

Image 1
  1. Here you need to set the number of queues you must use to get your query answered by the llm. Remember, higher the queues, more the GPU memory would be used. However, the mapping would be 1 query per queue.
  2. You can install models from the available list. If the model is already there it will show the Installed tag.
  3. You can uninstall models from the list of installed models.
  4. Here, you must provide your dataset in json form. (this is optional)
  5. Probe time interval for GPU bandwidth monitor. for less to get better metrics.
  6. Single query is for just one query. Batch query is for multiple queries in 1 go. For this you need a csv file. An example of the a set of queries based on the popular show Pokemon is below
  7. Select the llm call. We have two options, one is for direct call and second is for through ollama call. Where direct calls have the option to profile GPU bandwidth with the use of ncu. ollama call: We’re using ollama library which support handling of llm models, direct call: We leverage the Llama index to manage interactions with LLM models, enabling GPU bandwidth profiling through NCU. By incorporating GPU layers, we gain enhanced control over the process.
  8. When we select a direct call this option will be available. If profile gpu bandwidth selected it will return with additional csv data which has profiling data for bandwidth.
  9. Select the model from installed models list
  10. Enter a query which you want to ask to llm or you can upload csv if we selected a batch query.
Image 2
  1. run: Run the query with inputs
  2. install: Install available models
  3. ls: List available models
  4. uninstall: Remove installed models
  5. show : Show the output
  6. sessions: Show all the sessions
  7. rm : Remove all and specific session
  8. config: Set the session deletion period
  9. --verbose: Check the process logs
  10. help [command]: display help for command