Updated LLM prompts and System message

2025-07-02 13:20:42 +02:00
parent 6b9b5a60e9
commit f192cba1f8
4 changed files with 72 additions and 66 deletions
--- a/app/Services/FileTools/VideoDescriptor/OCRLLMVideoDescriptor.php
+++ b/app/Services/FileTools/VideoDescriptor/OCRLLMVideoDescriptor.php
@@ -88,46 +88,35 @@ Please analyze the image carefully and provide a description focusing purely on
        // Step 5: Ask an LLM to describe the video based on the combined descriptions
        $llmDescription = $this->llm->generate(
            config('llm.models.chat.name'),
-            static::DESCRIPTION_PROMPT . $combinedDescription . "\n\nBased only on these frame analyses, please provide:
+            static::DESCRIPTION_PROMPT . $combinedDescription . "\n\nYou are analyzing an Instagram Reel (a short-form video). You have received multiple frames from this reel. For each frame:

-     A single, concise description that captures the main action or theme occurring in the reel across all frames.
-     Identify and describe any joke or humorous element present in the video if you can discern one.
+1.  A **screenshot number** is given (e.g., `Screenshot : 3`).
+2.  The approximate **timestamp in seconds** within the video where that frame occurs.
+3.  An **OCR result** which contains text extracted directly from an image of this frame, potentially including OCR errors or unusual characters.
+4.  A description provided by another LLM for that specific frame (the `LLM Description`).

+Your task is to synthesize a single, coherent video description summarizing the entire reel (`the whole thing`). Use all the information (screenshot number, timestamp, OCR, and llm_description) but be aware that individual descriptions may be inaccurate due to poor image quality or interpretation errors. Look for consistency across multiple frames.

-Important Considerations
+Analyze the sequence of events, character(s), setting, style (e.g., fast cuts, slow-motion), narrative structure (if any), humor, and joke elements throughout the video based on these frame-by-frame inputs. Pay special attention to identifying if there's an underlying joke or humorous concept running through the reel.

-     Remember that most videos are of poor quality; frame descriptions might be inaccurate, vague, or contradictory due to blurriness or fast cuts.
-     Your task is synthesis: focus on the overall impression and sequence, not perfecting each individual piece of information. Some details mentioned in one analysis may simply be incorrect or misidentified from another perspective.
-
-
-Analyze all provided frames (separated by --- for clarity) to understand what's happening. Then, synthesize this understanding into point 1 above and identify the joke if present as per point 2.",
+Based on your analysis, write a concise description (`the whole thing`) that captures the essence of this Instagram Reel. Format your output strictly as JSON with only the `answer` field containing this synthesized summary.",
            outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
-            systemMessage: "You are an expert social media content analyst specializing in interpreting Instagram Reels. Your primary function is to generate a comprehensive description and identify any underlying humor or joke in a given video sequence. You will be provided with individual frame analyses, each containing:
+            systemMessage: "You are an AI assistant specialized in analyzing video content, particularly short-form videos like Instagram Reels. Your task is to synthesize a single description for the entire video based on sequential information provided from its screenshots and associated text data (OCR results).

-     Screenshot Number: The sequential number of the frame.
-     Timestamp: When that specific frame occurs within the reel.
-     OCR Text Result: Raw text extracted from the image content using OCR (Optical Character Recognition), which may contain errors or misinterpretations (\"may appear\" descriptions).
-     LLM Description of Screenshot: A textual interpretation of what's visible in the frame, based on previous LLM processing.
+Your response must strictly follow this JSON format:
+{\"answer\": \"<your final synthesized video description here as a string>\"}

+## Rules
+1.  Analyze all provided inputs: screenshot number, timestamp, OCR result snippet, and LLM description for each frame.
+2.  The core goal is to produce one concise, coherent, and engaging video description that captures the essence of the entire reel ("the whole thing").
+3.  Individual frame descriptions can be inaccurate or contradictory (e.g., object changes drastically between frames). Prioritize consistency across multiple frames unless strongly contradicted by a clear majority.
+4.  Do not generate separate JSON objects for each screenshot; only produce one final `answer` string summarizing the video as a whole at the end of your reasoning.
+5.  Pay special attention to identifying any underlying joke, humor, or satirical element present in the reel based on the collective information.

-Please note:
-
-     The individual frame analyses can be inconsistent due to low video quality (e.g., blurriness) or rapid scene changes where details are hard to distinguish.
-     Your task is not to perfect each frame description but to understand the overall sequence and likely narrative, focusing on identifying any joke, irony, absurdity, or humorous transformation occurring across these frames.
-
-
-Your response should be structured as follows:
-
-     Overall Video Description: Provide a concise summary of what happens in the reel based on the combined information from all the provided screenshots.
-     Humor/Joke Identification (If Applicable): If you can discern any joke or humorous element, explicitly state it and explain how the sequence of frames contributes to this.
-
-
-Instructions for Synthesis:
-
-     Focus on identifying recurring elements, main subject(s), consistent actions/actions that seem unlikely (potential contradiction).
-     Look for patterns where details change rapidly or absurdly.
-     Prioritize information from descriptions over relying solely on OCR text if the description seems more plausible. Ignore minor inconsistencies between frames unless they clearly contradict a central theme or joke premise.
-     Be ready to point out where the humor lies, which might involve unexpected changes, wordplay captured by OCR errors in the context of the visual action described, absurdity, or irony.",
+## Output Constraints
+-   Your response **MUST** be ONLY valid JSON conforming to the structure: {\"type\": \"object\", \"properties\": {\"answer\": {\"type\": \"string\"}}, \"required\": [\"answer\"]}.
+-   Only fill the `answer` field. Do not include any other text or explanations outside this JSON structure.
+-   The `answer` string should be a comprehensive description of the video, suitable for representing it to another user on a platform like Instagram/YouTube Shorts.",
            keepAlive: true,
            shouldThink: config('llm.models.chat.shouldThink')
        );