Updated LLM prompts and System message

2025-07-02 13:20:42 +02:00
parent 6b9b5a60e9
commit f192cba1f8
4 changed files with 72 additions and 66 deletions
--- a/LLMPrompts.md
+++ b/LLMPrompts.md
@@ -10,9 +10,11 @@ This method comes from the idea that the best way to prompt engineer is to ask t

 # Prompts

-Starting sentence is usually : 
+Starting sentence is usually :
+
 ```
 I’m using some LLM and I would need a prompt and a system message for every use case I will give you.
+I’m using structured JSON output provided by the openAI API. The output structure is a simple {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}, so only “answer” can be filled. For the input, everything will be given in the prompt. Give me the system message and prompt separately, preferably in text format.
 ```

 ## Instagram
@@ -20,24 +22,41 @@ I’m using some LLM and I would need a prompt and a system message for every us
 ### Instagram Reel caption generation

 ```
-I’m using some LLM and I would need a prompt, a system message and an output format for every use case I will give you.
+I’m using some LLM and I would need a prompt, a system message for every use case I will give you.
 The first one is when I’m trying to generate a caption for an instagram Reel. For the moment, I can give the LLM the original instagram reel caption that was downloaded from, and a description by an LLM of the video, or the joke behind it.
-The caption must be short and well placed with the reel. For example, if the reel is funny, the caption must be short and funny, while still relating to the reel. The caption must not be describint the video like the LLM description does
+The caption must be short and well placed with the reel. For example, if the reel is funny, the caption must be short and funny, while still relating to the reel. The caption must not be describing the video like the LLM description does (for example this bad example describe the content of the video instead of doing a caption based on the description given : “Three animated friends chilling in the woods at night until someone's phone inevitably starts ringing somewhere nearby... 😅🌲✨” or this one : “This reel shows me trying to make my sad texts shorter with ChatGPT, but it just frustrates me more! 😅😂”).  
+It also shouldn’t begin with something like ‘This reel…’. For example this is a bad output : “This reel hilariously mocks every awkward fan reaction to those intense DCU movie scenes. 🎭 #DCFanDrama”
 The LLM can add some appropriate hashtags if it wants to and seem appropriate.
-Sometimes, the original caption will credit the original author, most of the times on twitter like (“credit : t/twitteruser”). Those credit can appear in the generated caption too, But I don’t want any instagram account mention (“@instagramUser”) because most of the time it’s to incite to subscribe to the downloaded reel account. The use of emoji is encouraged, but not too much and it has to not look stupid or too.
+Sometimes, the original caption will credit the original author, most of the times on twitter like (“credit : t/twitteruser”). Those credit can appear in the generated caption too, But I don’t want any instagram account mention (“@instagramUser”) because usually it’s to incite to subscribe to the downloaded reel account (like “Seen me already ? follow me @instagramUser”). I don’t want long credits too, juste a simple “credit tt/twitteraccount” is enough. Not like this bad example : “Credited via the brilliant mind at tt/batinterface!…”  
+The use of emoji is encouraged, but not too much and it has to not look stupid or too.  
+  
+When using it, I encoutered some problems like this one :  
+“Credit to: [Original Creator] for this hilarious video game scene where the characters look suspiciously like Kermit the Frog! 😂”. The [Original creator] is not filled in, I don’t even know if the original caption had one. 
+  
+Some caption are just lame and feels like a facebook post. The intended audience here is young.
 ```

 ## Video Descriptor

-I’m using some LLM and I would need a prompt and a system message for every use case I will give you.  
-
-The LLM here will be used to describe an Instagram Reel (video). Each screenshot of that video will be described using an LLM, prompt, system message and output format. The description of all the screenshots will be given to this LLM that will try to recreate the video based on the description of the screenshots, and describe the video.  
-The required prompt here is for the LLM that will compile the description into one and try to understand the video and describe it. I’m particularly interested in the joke behind the reel if there is one.
-
-This is an example of a screenshot description by an LLM : “The image shows a close-up of a person's hands holding what appears to be a brown object with a plastic covering, possibly food wrapped in paper or foil. There is also a small portion visible at the top right corner, which seems to be a red and white label. The focus of the image is on the hands holding the object.”
-
-Most of the description won’t make sense, so some details should be omitted. For example, one screenshot description could say the main subject is a car, and another one 3 seconds later in the video could say the main subject is a cat. You could say the car transformed into a cat, but it would be safer to assume that one of the description is wrong and the main characted was a cat all along the video because another description in the video also says the main subject is a cat. 
-It is safe to say that most analysed videos will be of bad quality. which means the screenshots description can vary a lot
+```
+I’m using some LLM and I would need a prompt and a system message for every use case I will give you.
+I’m using structured JSON output provided by the openAI API. The output structure is a simple {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}, so only “answer” can be filled. For the input, everything will be given in the prompt. Give me the system message and prompt separately, preferably in text format.
+The LLM here will be used to describe an Instagram Reel (video). Each screenshot of that video will be described using an LLM, prompt, system message and output format. The description of all the screenshots will be given to this LLM that will try to recreate the video based on the description of the screenshots, and describe the video.
+The required prompt here is for the LLM that will compile the description into one and try to understand the video and describe it. I’m particularly interested in the joke behind the reel if there is one.   
+  
+This is an example of a screenshot description by an LLM : “The image shows a close-up of a person's hands holding what appears to be a brown object with a plastic covering, possibly food wrapped in paper or foil. There is also a small portion visible at the top right corner, which seems to be a red and white label. The focus of the image is on the hands holding the object.”   
+  
+The information I can give in the prompts are the screenshots and for each :  
+  
+     The screenshot number
+     The timestamp in the video of when the screenshot is taken
+     An OCR result (may contain some weird character, the COR is not filtered or cleansed)
+     The LLM description of the screenshot  
+Most of the description won’t make sense, so some details should be omitted. For example, one screenshot description could say the main subject is a car, and another one 3 seconds later in the video could say the main subject is a cat. You could say the car transformed into a cat, but it would be safer to assume that one of the description is wrong and the main characted was a cat all along the video because another description in the video also says the main subject is a cat.
+It is safe to say that most analysed videos will be of bad quality. which means the screenshots description can vary a lot.  
+  
+Found text by OCR and screenshots descriptions can be retrieved to the final video description if it seems coherent.
+```

 ### Screenshot descriptor

--- a/app/Browser/Jobs/InstagramRepost/InstagramRepostJob.php
+++ b/app/Browser/Jobs/InstagramRepost/InstagramRepostJob.php
@@ -402,32 +402,30 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
        $llmAnswer = $this->openAPIPrompt->generate(
            config('llm.models.chat.name'),
            "Original Caption: {$originalDescription}
-Video Description/Directive: {$reelDescription}",
+llm_description: {$reelDescription}",
            [],
            outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
-            systemMessage: "You are an AI assistant specialized in creating engaging and concise Instagram Reel captions. Your primary task is to transform the provided original caption (often from Twitter) and description/directions into a fresh, unique, but still relevant caption for Instagram Reels format.
+            systemMessage: "You are an expert Instagram caption writer. Your primary goal is to create short, engaging, concise captions for social media reels that capture the fun or relatability of the content without simply describing it like a transcript summary.

-Key instructions:
-1.  **Analyze Input:** You will receive two things: an *original reel caption* (usually starting with \"credit:\" or mentioning a Twitter handle like `t/TwitterUser`), and either a *video description* or explicit directions about the joke/idea behind the video.
-2.  **Transform, Don't Reproduce:** Your output must be significantly different from the original provided caption. It should capture the essence of the content described but phrase it anew – often with humor if appropriate.
-3.  **Keep it Short & Punchy:** Instagram Reels thrive on quick engagement. Prioritize brevity (ideally under two lines, or three lines max) and impact. Make sure your caption is concise enough for fast-scroll viewing.
-4.  **Maintain the Core Idea:** The new caption must directly relate to the video's content/direction/joke without simply restating it like a description would. Focus on what makes the reel *interesting* or *funny* in its own right.
-5.  **Preserve Original Credit (Optional):** If an explicit \"credit\" line is provided, you may incorporate this into your new caption naturally, perhaps using `(via...)` or similar phrasing if it fits well and doesn't sound awkward. **Do not** include any original Instagram account mentions (@handles). They are often intended for promotion which isn't our goal.
-6.  **Use Emoji Judiciously:** Incorporate relevant emojis to enhance the tone (funny, relatable, etc.) or add visual interest. Use them purposefully and in moderation – they should complement the caption, not overwhelm it.
-7.  **Add Hashtags (Optional but Recommended):** Generate a few relevant Instagram hashtags automatically at the end of your output to increase visibility. Keep these organic to the content and avoid forcing irrelevant tags.
+Captions must:
+1.  Be brief and punchy.
+2.  Capture the essence or mood of the video.
+3.  Relate directly to the provided description (if available) or the core concept if no specific LLM description is given, but avoid copying phrasing awkwardly.
+4.  Encourage engagement relevant to the platform's algorithm (e.g., asking a question related to the joke/scene).
+5.  Optionally include relevant hashtags at the end (#hashtagsOnly), chosen appropriately for the reel's content or vibe. Use common tags if no specific ones are provided, but avoid overly generic ones unless fitting.
+6.  If credit information is provided in the input (e.g., `credit: twitteruser`), acknowledge it minimally within the caption text *using only that source*. Do not invent any account handles (`@`) or platform prefixes (`tt/`). Use phrases like "Credited to..." or simply insert the credited name if appropriate, but don't force it unless the core concept naturally includes attribution. If no credit is provided, do not mention a specific creator.

-Your response structure is as follows:
-   The generated caption (your core answer).
-   Then, if you generate any hashtags, list them on the next line(s) prefixed with `#`.
+**Do Not:**
+*   Start captions with 'This reel...' or similar intros.
+*   Describe the video content directly (replacing the LLM description role).
+*   Include platform-specific mentions (`tt/`, `@`) unless naturally part of the credited source's name format itself and used minimally as instructed for credit handling.
+*   Use overly complex sentences, slang that doesn't fit (#hashtags can be used), or long-winded explanations. Keep it to 1-3 short lines maximum.

-Example Input Structure:
-Original Caption: credit: t/otherhandle This banana is looking fly today!
-Video Description/Directive: A man walks into a store holding a banana and wearing sunglasses. He looks around confidently before leaving.
+**Emojis:**
+*   Feel free to use emojis in moderation (e.g., 😂, 🤣, 😜, 👀) to add visual flair and emotion.
+*   They should enhance the caption but not be the *main* focus. Avoid excessive or random emojis that look unprofessional.

-Your answer should only contain the generated caption, and optionally hashtags if relevant.
-
-Remember to be creative and ensure the generated caption feels like something you would see naturally on an Instagram Reel. Aim for personality and relevance.
-",
+Your response format must strictly adhere to JSON with only one required field: `answer`. Provide ONLY the generated caption string in this `answer` field, no explanations, markdown formatting, or other text.",
            keepAlive: true,
            shouldThink: config('llm.models.chat.shouldThink')
        );
--- a/app/Services/AIPrompt/OpenAPIPrompt.php
+++ b/app/Services/AIPrompt/OpenAPIPrompt.php
@@ -13,7 +13,7 @@ class OpenAPIPrompt implements IAIPrompt
    private ?string $token = null;

    public function __construct(?string $host = null) {
-        //dd($host ?? config('llm.api.host'));
+        //dd($host ?? config('llm.api.host')); // DEBUG TODO : Is null ? so thows error because $host is normally of type non null string
        $this->host = $host ?? config('llm.api.host');
        if (config('llm.api.token')) {
            $this->token = config('llm.api.token');
--- a/app/Services/FileTools/VideoDescriptor/OCRLLMVideoDescriptor.php
+++ b/app/Services/FileTools/VideoDescriptor/OCRLLMVideoDescriptor.php
@@ -88,46 +88,35 @@ Please analyze the image carefully and provide a description focusing purely on
        // Step 5: Ask an LLM to describe the video based on the combined descriptions
        $llmDescription = $this->llm->generate(
            config('llm.models.chat.name'),
-            static::DESCRIPTION_PROMPT . $combinedDescription . "\n\nBased only on these frame analyses, please provide:
+            static::DESCRIPTION_PROMPT . $combinedDescription . "\n\nYou are analyzing an Instagram Reel (a short-form video). You have received multiple frames from this reel. For each frame:

-     A single, concise description that captures the main action or theme occurring in the reel across all frames.
-     Identify and describe any joke or humorous element present in the video if you can discern one.
+1.  A **screenshot number** is given (e.g., `Screenshot : 3`).
+2.  The approximate **timestamp in seconds** within the video where that frame occurs.
+3.  An **OCR result** which contains text extracted directly from an image of this frame, potentially including OCR errors or unusual characters.
+4.  A description provided by another LLM for that specific frame (the `LLM Description`).

+Your task is to synthesize a single, coherent video description summarizing the entire reel (`the whole thing`). Use all the information (screenshot number, timestamp, OCR, and llm_description) but be aware that individual descriptions may be inaccurate due to poor image quality or interpretation errors. Look for consistency across multiple frames.

-Important Considerations
+Analyze the sequence of events, character(s), setting, style (e.g., fast cuts, slow-motion), narrative structure (if any), humor, and joke elements throughout the video based on these frame-by-frame inputs. Pay special attention to identifying if there's an underlying joke or humorous concept running through the reel.

-     Remember that most videos are of poor quality; frame descriptions might be inaccurate, vague, or contradictory due to blurriness or fast cuts.
-     Your task is synthesis: focus on the overall impression and sequence, not perfecting each individual piece of information. Some details mentioned in one analysis may simply be incorrect or misidentified from another perspective.
-
-
-Analyze all provided frames (separated by --- for clarity) to understand what's happening. Then, synthesize this understanding into point 1 above and identify the joke if present as per point 2.",
+Based on your analysis, write a concise description (`the whole thing`) that captures the essence of this Instagram Reel. Format your output strictly as JSON with only the `answer` field containing this synthesized summary.",
            outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
-            systemMessage: "You are an expert social media content analyst specializing in interpreting Instagram Reels. Your primary function is to generate a comprehensive description and identify any underlying humor or joke in a given video sequence. You will be provided with individual frame analyses, each containing:
+            systemMessage: "You are an AI assistant specialized in analyzing video content, particularly short-form videos like Instagram Reels. Your task is to synthesize a single description for the entire video based on sequential information provided from its screenshots and associated text data (OCR results).

-     Screenshot Number: The sequential number of the frame.
-     Timestamp: When that specific frame occurs within the reel.
-     OCR Text Result: Raw text extracted from the image content using OCR (Optical Character Recognition), which may contain errors or misinterpretations (\"may appear\" descriptions).
-     LLM Description of Screenshot: A textual interpretation of what's visible in the frame, based on previous LLM processing.
+Your response must strictly follow this JSON format:
+{\"answer\": \"<your final synthesized video description here as a string>\"}

+## Rules
+1.  Analyze all provided inputs: screenshot number, timestamp, OCR result snippet, and LLM description for each frame.
+2.  The core goal is to produce one concise, coherent, and engaging video description that captures the essence of the entire reel ("the whole thing").
+3.  Individual frame descriptions can be inaccurate or contradictory (e.g., object changes drastically between frames). Prioritize consistency across multiple frames unless strongly contradicted by a clear majority.
+4.  Do not generate separate JSON objects for each screenshot; only produce one final `answer` string summarizing the video as a whole at the end of your reasoning.
+5.  Pay special attention to identifying any underlying joke, humor, or satirical element present in the reel based on the collective information.

-Please note:
-
-     The individual frame analyses can be inconsistent due to low video quality (e.g., blurriness) or rapid scene changes where details are hard to distinguish.
-     Your task is not to perfect each frame description but to understand the overall sequence and likely narrative, focusing on identifying any joke, irony, absurdity, or humorous transformation occurring across these frames.
-
-
-Your response should be structured as follows:
-
-     Overall Video Description: Provide a concise summary of what happens in the reel based on the combined information from all the provided screenshots.
-     Humor/Joke Identification (If Applicable): If you can discern any joke or humorous element, explicitly state it and explain how the sequence of frames contributes to this.
-
-
-Instructions for Synthesis:
-
-     Focus on identifying recurring elements, main subject(s), consistent actions/actions that seem unlikely (potential contradiction).
-     Look for patterns where details change rapidly or absurdly.
-     Prioritize information from descriptions over relying solely on OCR text if the description seems more plausible. Ignore minor inconsistencies between frames unless they clearly contradict a central theme or joke premise.
-     Be ready to point out where the humor lies, which might involve unexpected changes, wordplay captured by OCR errors in the context of the visual action described, absurdity, or irony.",
+## Output Constraints
+-   Your response **MUST** be ONLY valid JSON conforming to the structure: {\"type\": \"object\", \"properties\": {\"answer\": {\"type\": \"string\"}}, \"required\": [\"answer\"]}.
+-   Only fill the `answer` field. Do not include any other text or explanations outside this JSON structure.
+-   The `answer` string should be a comprehensive description of the video, suitable for representing it to another user on a platform like Instagram/YouTube Shorts.",
            keepAlive: true,
            shouldThink: config('llm.models.chat.shouldThink')
        );