This website uses cookies to improve user experience. By using our website you consent to all cookies in accordance with our Cookie Policy
Llama3.2-1B
avatar
Jaqpot Team@jaqpot-team
4 months ago
OPENAI_LLM

Notice: This model is currently down for maintainance

Llama3.2-1B Model

A lightweight version of the Llama language model trained on 1 billion parameters. This model offers a balance between performance and computational requirements, making it suitable for tasks like text generation, analysis, and basic language understanding. It performs well on general language tasks while being more efficient to run than larger Llama variants.

The model can handle:

  • Text generation
  • Basic question answering
  • Content summarization
  • Simple analysis tasks

Best suited for users who need decent language model capabilities but have limited computational resources or need faster inference times.

Note: Performance may be lower compared to larger Llama models, but it offers good results for many common language tasks.

Here's the python code we used to deploy this model as a service using FastAPI as a wrapper:

import asyncio from queue import Queue from threading import Thread import joblib import pandas as pd import torch from fastapi.responses import StreamingResponse from jaqpot_api_client.models.prediction_request import PredictionRequest from src.streamer import CustomStreamer class ModelService: def __init__(self): self.model = joblib.load('model.pkl') self.tokenizer = joblib.load('tokenizer.pkl') def infer(self, request: PredictionRequest) -> StreamingResponse: # Convert input list to DataFrame input_data = pd.DataFrame(request.dataset.input) last_index = input_data.index[-1] input_row = input_data.iloc[last_index] prompt = input_row['prompt'] return StreamingResponse(self.response_generator(prompt), media_type='text/event-stream') # The generation process def start_generation(self, query: str, streamer): prompt = f"""{query}""" device = "cuda" if torch.cuda.is_available() else "cpu" self.model = self.model.to(device) # Update generation line: inputs = self.tokenizer([prompt], return_tensors="pt").to(device) generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=4096, temperature=0.1) thread = Thread(target=self.model.generate, kwargs=generation_kwargs) thread.start() # Generation initiator and response server async def response_generator(self, query: str): streamer_queue = Queue() streamer = CustomStreamer(streamer_queue, self.tokenizer, True) self.start_generation(query, streamer) while True: value = streamer_queue.get() if value is None: break yield value streamer_queue.task_done() await asyncio.sleep(0.1)