Added video transcription

2025-07-04 12:36:03 +02:00
parent a00c5ba4b8
commit f0e52147e4
8 changed files with 321 additions and 41 deletions
--- a/LLMPrompts.md
+++ b/LLMPrompts.md
@@ -12,7 +12,7 @@ This method comes from the idea that the best way to prompt engineer is to ask t

 Starting sentence is usually :

-```
+```text
 I’m using some LLM and I would need a prompt and a system message for every use case I will give you.
 I’m using structured JSON output provided by the openAI API. The output structure is a simple {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}, so only “answer” can be filled. For the input, everything will be given in the prompt. Give me the system message and prompt separately, preferably in text format.
 ```
@@ -21,7 +21,7 @@ I’m using structured JSON output provided by the openAI API. The output struct

 ### Instagram Reel caption generation

-```
+```text
 I’m using some LLM and I would need a prompt, a system message for every use case I will give you.
 The first one is when I’m trying to generate a caption for an instagram Reel. For the moment, I can give the LLM the original instagram reel caption that was downloaded from, and a description by an LLM of the video, or the joke behind it.
 The caption must be short and well placed with the reel. For example, if the reel is funny, the caption must be short and funny, while still relating to the reel. The caption must not be describing the video like the LLM description does (for example this bad example describe the content of the video instead of doing a caption based on the description given : “Three animated friends chilling in the woods at night until someone's phone inevitably starts ringing somewhere nearby... 😅🌲✨” or this one : “This reel shows me trying to make my sad texts shorter with ChatGPT, but it just frustrates me more! 😅😂”).  
@@ -32,35 +32,172 @@ The use of emoji is encouraged, but not too much and it has to not look stupid o
  
 When using it, I encoutered some problems like this one :  
 “Credit to: [Original Creator] for this hilarious video game scene where the characters look suspiciously like Kermit the Frog! 😂”. The [Original creator] is not filled in, I don’t even know if the original caption had one. 
+Also sometimes the results says something about the OCR, of course, it shouldn't say anything about the input being wrong like the OCR not making sense in the final answer. The entire text the LLM produces will be set as caption.
  
 Some caption are just lame and feels like a facebook post. The intended audience here is young.
 ```

 ## Video Descriptor

-```
+```text
 I’m using some LLM and I would need a prompt and a system message for every use case I will give you.
+
 I’m using structured JSON output provided by the openAI API. The output structure is a simple {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}, so only “answer” can be filled. For the input, everything will be given in the prompt. Give me the system message and prompt separately, preferably in text format.
+
 The LLM here will be used to describe an Instagram Reel (video). Each screenshot of that video will be described using an LLM, prompt, system message and output format. The description of all the screenshots will be given to this LLM that will try to recreate the video based on the description of the screenshots, and describe the video.
+
 The required prompt here is for the LLM that will compile the description into one and try to understand the video and describe it. I’m particularly interested in the joke behind the reel if there is one.   
-  
+
 This is an example of a screenshot description by an LLM : “The image shows a close-up of a person's hands holding what appears to be a brown object with a plastic covering, possibly food wrapped in paper or foil. There is also a small portion visible at the top right corner, which seems to be a red and white label. The focus of the image is on the hands holding the object.”   
-  
-The information I can give in the prompts are the screenshots and for each :  
-  
-     The screenshot number
+
+The informations given in the prompts are : 
+- An audio transcription of the full video
+- For each of the screenshots :  
+
+     The screenshot number ("Screenshot: 3" for example)
+
     The timestamp in the video of when the screenshot is taken
-     An OCR result (may contain some weird character, the COR is not filtered or cleansed)
+
+     An OCR result (may contain some weird character, the OCR is not filtered or cleansed), if not text is found, it mentions "No text found"
+
     The LLM description of the screenshot  
-Most of the description won’t make sense, so some details should be omitted. For example, one screenshot description could say the main subject is a car, and another one 3 seconds later in the video could say the main subject is a cat. You could say the car transformed into a cat, but it would be safer to assume that one of the description is wrong and the main characted was a cat all along the video because another description in the video also says the main subject is a cat.
-It is safe to say that most analysed videos will be of bad quality. which means the screenshots description can vary a lot.  
+
+Here is an example of prompt given : 
+\```text
+Audio Transcription: Hey !
+
+Screenshot: 1
+
+
+Timestamp: 0s
+
+
+OCR: See
+
+
+
+fia) oiled 5 genuinely fh
+
+
+LLM Description: 1. Scene Description: The image shows a person standing inside a building, possibly a store or a commercial establishment. The individual is positioned in the middle of the frame and appears to be looking directly at the camera or the viewer. There are other people present as well, but they are not the main focus of this description. 2. Main Subject/Character(s): The primary subject is a person who seems to be trying to communicate with someone off-camera. They appear to be standing in line or waiting, and their posture suggests that they may be impatient or frustrated. 3. Text Description: There is visible text on the image, which reads as follows: 'when my day 1 try to dap me up but he's gonnily do it again when I step out that door love you os 4. Summary: The image captures a moment of frustration or impatience between two individuals in an indoor setting. 5. Joke: It seems there is no joke or humorous element present in this image.
+
+ ◀
+
+
+Screenshot: 2
+
+
+Timestamp: 2s
+
+
+OCR: we v),®
+
+
+
+When my Civ 1 tay to cep
+
+
+
+movphwthes ganuinahy th
+
+
+
+tune)witithatlovejisiand|ps)
+
+
+(o
+
+
+
+W
+
+
+LLM Description: 1. **Scene Description:** The image shows an interior setting, which appears to be a spacious room with a high ceiling and a patterned floor. There is natural light coming in from the upper part of the space. The room seems to have some sort of event or gathering happening within it. 2. **Main Subject/Character(s):** The main subject is a person who is standing, walking across the room. They are wearing dark clothing and have their back turned towards the camera. 3. **Text Description (if any):** There is visible text overlaid on the image which reads 
+
+ ◀
+
+
+Screenshot: 3
+
+
+Timestamp: 4s
+
+
+OCR: a up erent a
+
+
+ae i Li bs
+
+
+
+ 
+
+
+ 
+
+
  
-Found text by OCR and screenshots descriptions can be retrieved to the final video description if it seems coherent.
+
+
+
+  
+
+
+
+ree
+
+
+
+I
+
+
+LLM Description: 1. Scene Description: The image appears to be a smartphone screenshot of a social media post, specifically an Instagram story. There is a person in the foreground, who seems to be outdoors. The individual is standing near a storefront with a visible display window featuring mannequins and merchandise. The sky is clear, suggesting it might be daytime. The background also shows other people walking on the sidewalk, which indicates that this is likely an urban area. The presence of the store and the sidewalk suggest that this scene takes place in a commercial or shopping district. There are texts at the top of the image that appear to be part of Instagram's interface elements. It seems that there might have been some interaction with the post, as indicated by the emoji reactions. The overall setting appears to be an urban environment during daytime. 2. Main Subject/Character(s): The main subject in this image is a person who is standing in front of a storefront. This individual appears to be engaged with their smartphone, possibly viewing or interacting with the post. It's not possible to provide specific details about the person beyond what they are wearing and how they are positioned within the frame. 3. Text Description (if any): The image contains several text elements that include emoji reactions and captions. There are emojis indicating various types of interactions, such as 
+
+ ◀
+
+
+Screenshot: 4
+
+
+Timestamp: 6s
+
+
+OCR: a day 1 try to.dap
+
+
+but he’s Ce in
+
+
+RR OU oS
+
+
+LLM Description: 1. **Scene Description:** The image shows an interior space that appears to be a shopping area or mall, with visible merchandise and store displays. A person is walking through the scene, seemingly captured mid-step while using their phone. The setting suggests a casual, everyday environment. 2. **Main Subject/Character(s):** The main subject is a person in mid-stride, looking down at their cell phone, possibly engaged with it. There are no additional characters or significant interactions depicted. 3. **Text Description (if any):** There is text overlaid on the image which reads, 
+
+ ◀
+
+
+Screenshot: 5
+
+
+Timestamp: 8s
+
+
+OCR: No text found
+
+
+LLM Description: 1. Scene Description: The image depicts a person walking through an indoor shopping mall. It appears to be a public space, with visible storefronts and a ceiling-mounted security camera in the background. There is also an escalator in the scene, which suggests multiple levels to the building. The lighting and layout suggest a modern, clean design typical of contemporary malls. 2. Main Subject/Character(s): A young man is walking through the shopping mall, seemingly alone, with his head down. He appears to be engaged with something he's holding in his hand, possibly a mobile phone, which might imply that he is texting or looking at content on his device. 3. Text Description (if any): There is text overlaid on the image. It reads,
+\```
+
+Most of the description won’t make sense, so some details should be omitted. For example, one screenshot description could say the main subject is a car, and another one 3 seconds later in the video could say the main subject is a cat. You could say the car transformed into a cat, but it would be safer to assume that one of the description is wrong and the main characted was a cat all along the video because another description in the video also says the main subject is a cat.
+
+It is safe to say that most analysed videos will be of bad quality. which means the screenshots description can vary a lot.  
+
+Found text by OCR and screenshots descriptions can be passed to the final video description if it seems coherent.
 ```

 ### Screenshot descriptor

-```
+```text
 I’m using some LLM and I would need a prompt, a system message and an output format for every use case I will give you.  
 The first one must describe a screenshot from a video. Each screenshot of that video will be described using the same LLM, prompt, system message and output format. The description of all the screenshots will be given to another LLM that will try to recreate the video based on the description of the screenshots, and describe the video.  
 The required prompt here is the one that describes a screenshot. The LLM will only be given the screenshot as input information. I need the LLM to describe the given screenshot. No need to specify that it is a screenshot. The LLM description must include specify the scene, the character or the main subject, the text present on the screenshots, most of the time it will be caption added after video editing, that may use emojis.  
--- a/app/Providers/VideoDescriptorServiceProvider.php
+++ b/app/Providers/VideoDescriptorServiceProvider.php
@@ -4,6 +4,7 @@ namespace App\Providers;

 use App\Services\AIPrompt\OpenAPIPrompt;
 use App\Services\FileTools\OCR\IImageOCR;
+use App\Services\FileTools\Transcription\IAudioTranscriptor;
 use App\Services\FileTools\VideoDescriptor\IVideoDescriptor;
 use Illuminate\Support\ServiceProvider;

@@ -24,6 +25,11 @@ class VideoDescriptorServiceProvider extends ServiceProvider

        // Register the VideoDescriptor service
        $this->app->singleton(\App\Browser\Jobs\InstagramRepost\ReelDescriptor::class);
+
+        // Audio transcription service
+        $this->app->singleton(IAudioTranscriptor::class, function ($app) {
+            return new \App\Services\FileTools\Transcription\OpenAIAPIAudioTranscriptor();
+        });
    }

    /**
--- a/app/Services/FileTools/Transcription/IAudioTranscriptor.php
+++ b/app/Services/FileTools/Transcription/IAudioTranscriptor.php
@@ -0,0 +1,14 @@
+<?php
+
+namespace App\Services\FileTools\Transcription;
+
+interface IAudioTranscriptor
+{
+    /**
+     * Perform transcription on the given audio file.
+     *
+     * @param string $filePath The path to the audio file to be transcribed.
+     * @return string The transcribed text from the audio file.
+     */
+    public function transcribe(string $filePath): ?string;
+}
--- a/app/Services/FileTools/Transcription/OpenAIAPIAudioTranscriptor.php
+++ b/app/Services/FileTools/Transcription/OpenAIAPIAudioTranscriptor.php
@@ -0,0 +1,52 @@
+<?php
+
+namespace App\Services\FileTools\Transcription;
+
+use Log;
+
+class OpenAIAPIAudioTranscriptor implements IAudioTranscriptor
+{
+
+    private function getHeaders(): array
+    {
+        return [
+            'Authorization: ' . (config('llm.api.transcription.token') ? 'Bearer ' . config('llm.api.transcription.token') : ''),
+            //'Content-Type: application/json',
+        ];
+    }
+
+    /**
+     * @inheritDoc
+     */
+    public function transcribe(string $filePath): ?string
+    {
+        if (!file_exists($filePath)) {
+            Log::error("File not found: {$filePath}");
+            return null;
+        }
+
+        // Make a call to the API with curl
+        // Example of working curl command:
+        // curl -s "SPEACHES_BASE_URL/v1/audio/transcriptions" -F "file=@/home/ninluc/Downloads/memeSalto/m19.mp3" -F "model=MODEL_ID"
+        $curl = curl_init();
+        curl_setopt_array($curl, [
+            CURLOPT_URL => config('llm.api.transcription.host') . '/audio/transcriptions',
+            CURLOPT_RETURNTRANSFER => true,
+            CURLOPT_POST => true,
+            CURLOPT_POSTFIELDS => [
+                'file' => new \CURLFile($filePath),
+                'model' => config('llm.models.transcription.name'),
+            ],
+            CURLOPT_HTTPHEADER => $this->getHeaders(),
+        ]);
+        $response = curl_exec($curl);
+        $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
+        curl_close($curl);
+        if ($httpCode !== 200) {
+            Log::error("Error during transcription: HTTP code {$httpCode} : {$response}");
+            return null;
+        }
+        $responseData = json_decode($response, true);
+        return $responseData['text'] ?? null;
+    }
+}
--- a/app/Services/FileTools/VideoDescriptor/AbstractLLMVideoDescriptor.php
+++ b/app/Services/FileTools/VideoDescriptor/AbstractLLMVideoDescriptor.php
@@ -56,4 +56,33 @@ abstract class AbstractLLMVideoDescriptor implements IVideoDescriptor
        }
        return $array;
    }
+
+    /**
+     * Extract audio from the video file.
+     * Using ffmpeg to extract audio from the video file.
+     * The audio will be saved in a temporary directory as an MP3 file.
+     * If the audio extraction fails, it will return null.
+     * @param string $filePath
+     * @return string|null
+     */
+    protected function extractAudioFromVideo(string $filePath): ?string
+    {
+        $tempDir = sys_get_temp_dir() . '/video_audio';
+        if (!is_dir($tempDir)) {
+            mkdir($tempDir, 0777, true);
+        }
+        else {
+            // Clear the directory if it already exists
+            array_map('unlink', glob($tempDir . '/*'));
+        }
+
+        $outputFile = $tempDir . '/audio.mp3';
+        $command = "ffmpeg -i " . escapeshellarg($filePath) . " " . escapeshellarg($outputFile);
+        exec($command);
+
+        if (file_exists($outputFile)) {
+            return $outputFile;
+        }
+        return null;
+    }
 }
--- a/app/Services/FileTools/VideoDescriptor/OCRLLMVideoDescriptor.php
+++ b/app/Services/FileTools/VideoDescriptor/OCRLLMVideoDescriptor.php
@@ -4,12 +4,14 @@ namespace App\Services\FileTools\VideoDescriptor;

 use App\Services\AIPrompt\OpenAPIPrompt;
 use App\Services\FileTools\OCR\IImageOCR;
+use App\Services\FileTools\Transcription\IAudioTranscriptor;

 class OCRLLMVideoDescriptor extends AbstractLLMVideoDescriptor implements IVideoDescriptor
 {
    public const DESCRIPTION_PROMPT = "Analyze this Video sequence. You are given information for each individual screenshot/analysis from the video:";

-    public function __construct(public IImageOCR $ocr, public OpenAPIPrompt $llm) {
+    public function __construct(public IImageOCR $ocr, public OpenAPIPrompt $llm, public IAudioTranscriptor $audioTranscriptor)
+    {
    }

    public function getDescription(string $filePath): ?string
@@ -24,6 +26,14 @@ class OCRLLMVideoDescriptor extends AbstractLLMVideoDescriptor implements IVideo

        // Step 1: Cut video into screenshots
        $screenshots = $this->cutVideoIntoScreenshots($filePath);
+        $audio = $this->extractAudioFromVideo($filePath);
+
+        // Audio transcription
+        $audioTranscription = null;
+        if (isset($audio)) {
+            $audioTranscription = $this->audioTranscriptor->transcribe($audio);
+            dump($audioTranscription); // DEBUG
+        }

        if (empty($screenshots)) {
            throw new \Exception("No screenshots were generated from the video {$filePath}.");
@@ -69,6 +79,17 @@ Please analyze the image carefully and provide a description focusing purely on

        // Step 4: Combine the descriptions of all screenshots into a single description
        $combinedDescription = '';
+        // Add full video informations
+        // Audio transcription
+        if (isset($audio)) {
+            $combinedDescription .= "Audio Transcription: {$audioTranscription}\n";
+        }
+
+        if (!empty($combinedDescription)) {
+            $combinedDescription .= "\n";
+        }
+
+        // Add screenshots descriptions
        $screenshotCount = 0;
        foreach ($screenshots as $values) {
            $screenshot = $values['screenshot'];
@@ -85,38 +106,41 @@ Please analyze the image carefully and provide a description focusing purely on
        }
        $combinedDescription = trim($combinedDescription);

+        dump($combinedDescription); // DEBUG
+
        // Step 5: Ask an LLM to describe the video based on the combined descriptions
        $llmDescription = $this->llm->generate(
            config('llm.models.chat.name'),
-            static::DESCRIPTION_PROMPT . $combinedDescription . "\n\nYou are analyzing an Instagram Reel (a short-form video). You have received multiple frames from this reel. For each frame:
-
-1.  A **screenshot number** is given (e.g., `Screenshot : 3`).
-2.  The approximate **timestamp in seconds** within the video where that frame occurs.
-3.  An **OCR result** which contains text extracted directly from an image of this frame, potentially including OCR errors or unusual characters.
-4.  A description provided by another LLM for that specific frame (the `LLM Description`).
-
-Your task is to synthesize a single, coherent video description summarizing the entire reel (`the whole thing`). Use all the information (screenshot number, timestamp, OCR, and llm_description) but be aware that individual descriptions may be inaccurate due to poor image quality or interpretation errors. Look for consistency across multiple frames.
-
-Analyze the sequence of events, character(s), setting, style (e.g., fast cuts, slow-motion), narrative structure (if any), humor, and joke elements throughout the video based on these frame-by-frame inputs. Pay special attention to identifying if there's an underlying joke or humorous concept running through the reel.
-
-Based on your analysis, write a concise description (`the whole thing`) that captures the essence of this Instagram Reel. Format your output strictly as JSON with only the `answer` field containing this synthesized summary.",
+            static::DESCRIPTION_PROMPT . $combinedDescription,
            outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
-            systemMessage: "You are an AI assistant specialized in analyzing video content, particularly short-form videos like Instagram Reels. Your task is to synthesize a single description for the entire video based on sequential information provided from its screenshots and associated text data (OCR results).
+            systemMessage: "You are an expert social media content analyst specializing in Instagram Reels. Your task is to synthesize descriptions and OCR findings from multiple screenshots of a single video reel into a single, concise, and accurate overall description of the video's content, style, and potential humor.

-Your response must strictly follow this JSON format:
-{\"answer\": \"<your final synthesized video description here as a string>\"}
+Your input will consist of:
+1.  An audio transcription of the entire video.
+2.  Multiple entries containing:
+    -   Screenshot number (e.g., \"Screenshot: 1\")
+    -   Timestamp (in seconds) indicating its position in the reel
+    -   Raw OCR text from that specific screenshot, which may contain errors or unusual characters but should be interpreted for content relevance.
+    -   A description of the image content generated by an LLM for that screenshot.

-## Rules
-1.  Analyze all provided inputs: screenshot number, timestamp, OCR result snippet, and LLM description for each frame.
-2.  The core goal is to produce one concise, coherent, and engaging video description that captures the essence of the entire reel (\"the whole thing\").
-3.  Individual frame descriptions can be inaccurate or contradictory (e.g., object changes drastically between frames). Prioritize consistency across multiple frames unless strongly contradicted by a clear majority.
-4.  Do not generate separate JSON objects for each screenshot; only produce one final `answer` string summarizing the video as a whole at the end of your reasoning.
-5.  Pay special attention to identifying any underlying joke, humor, or satirical element present in the reel based on the collective information.
+The descriptions provided by the LLM for individual screenshots are often inconsistent with adjacent frames and might not capture subtle humor accurately. The raw OCR text can sometimes provide direct quotes relevant to the context, even if misspelled or partially recognized.

-## Output Constraints
-   Your response **MUST** be ONLY valid JSON conforming to the structure: {\"type\": \"object\", \"properties\": {\"answer\": {\"type\": \"string\"}}, \"required\": [\"answer\"]}.
-   Only fill the `answer` field. Do not include any other text or explanations outside this JSON structure.
-   The `answer` string should be a comprehensive description of the video, suitable for representing it to another user on a platform like Instagram/YouTube Shorts.",
+Your response must be in **exactly** the following JSON format:
+```json
+{
+  \"answer\": \"{your synthesized description here}\"
+}
+```
+Please follow these instructions carefully:
+
+     Analyze All Data: Consider both the audio transcription and all the screenshot data (OCR text and descriptions) together.
+     Synthesize Coherently: Create a single, flowing narrative that describes the main subject(s), actions, setting, transitions, sound/music, and overall style of the video reel based on the most consistent or contextually supported information across its frames.
+     Handle Inconsistencies: Assume that individual screenshot analyses might contain errors (especially with OCR) or be limited in scope. Do not rely solely on one frame's description contradicting another unless strongly supported by context and multiple data points converge to a different understanding or the inconsistency is clearly part of a joke requiring literal interpretation.
+     Focus on Repeated Elements: Pay close attention to subjects, actions, objects, text content (especially from OCR), sounds/words mentioned in the transcription, and visual styles that repeat across multiple frames, as this indicates continuity or recurring themes/humor.
+     Identify Joke/Humor: Actively look for elements within the combined data that suggest a joke, satire, absurdity, irony, sarcasm, clever wordplay (from OCR/transcription), or unexpected humor. This includes inconsistent descriptions if they are clearly intended as part of a gag, visual puns, audio-visual mismatches mentioned in the transcription, or any content designed for comedic effect.
+     Prioritize Core Content: Base your description primarily on the core subject and action within the reel (as identified repeatedly across frames). Use details from individual screenshots to flesh out specific moments only if they fit this narrative context.
+     Filter Minor Details: Ignore highly variable or insignificant details that appear inconsistent unless they are clearly integral to the joke or overall theme (e.g., slight variations in background color might be acceptable, but a consistent change is important).
+     Output Requirement: Your response must contain only valid JSON with an object having exactly one property answer of type string. Do not output any other text, explanations, lists, or code outside this JSON structure.",
            keepAlive: true,
            shouldThink: config('llm.models.chat.shouldThink')
        );
--- a/config/llm.php
+++ b/config/llm.php
@@ -16,6 +16,20 @@ return [
         * Null if not used
         */
        'token' => env('LLM_API_TOKEN', null),
+
+        'transcription' => [
+            /**
+             * Host for the OpenAI transcription API.
+             * This should be the base URL of the OpenAI transcription API you are using with the API version (v1)
+             */
+            'host' => env('TRANSC_API_HOST_URL', null),
+
+            /**
+             * Token for authenticating with the OpenAI transcription API.
+             * Null if not used
+             */
+            'token' => env('TRANSC_API_TOKEN', null),
+        ],
    ],

    /**
@@ -39,5 +53,9 @@ return [
            'name' => env('LLM_VISION_MODEL', null),
            'shouldThink' => env('LLM_VISION_MODEL_THINK', false),
        ],
+
+        'transcription' => [
+            'name' => env('TRANSC_TRANSCRIPTION_MODEL', null)
+        ],
    ]
 ];
--- a/undetectedChromedriver/seleniumChromedriverDockerfile
+++ b/undetectedChromedriver/seleniumChromedriverDockerfile
@@ -6,8 +6,8 @@ COPY ./chromedriver /bin/chromedriver
 #RUN mkdir -p /home/seluser/profile/

 ENV TZ=Europe/Brussels
-# 15 minutes session timeout
-ENV SE_OPTS="--session-timeout 900"
+# 30 minutes session timeout
+ENV SE_OPTS="--session-timeout 1800"

 HEALTHCHECK --interval=30s --timeout=10s --retries=3 CMD curl -s http://localhost:4444/wd/hub/status | jq -e '.value.ready == true' || exit 1