LLM reel caption and video description + Refactor in services

This commit is contained in:
2025-06-30 16:14:29 +02:00
parent 21abbcdff5
commit 228d67a48d
20 changed files with 575 additions and 151 deletions

52
LLMPrompts.md Normal file
View File

@ -0,0 +1,52 @@
# What is this file ?
This file provides the user prompts used to get the prompts directly from the LLM used to give answers.
For example, for the Instagram reel caption generation, here will be listed a prompt that asks the LLM to give
the prompt, system message and output format that will be used in the Instagram reel caption generation.
This method comes from the idea that the best way to prompt engineer is to ask the concerned model to generate it directly.
# Prompts
Starting sentence is usually :
```
Im using some LLM and I would need a prompt and a system message for every use case I will give you.
```
## Instagram
### Instagram Reel caption generation
```
Im using some LLM and I would need a prompt, a system message and an output format for every use case I will give you.
The first one is when Im trying to generate a caption for an instagram Reel. For the moment, I can give the LLM the original instagram reel caption that was downloaded from, and a description by an LLM of the video, or the joke behind it.
The caption must be short and well placed with the reel. For example, if the reel is funny, the caption must be short and funny, while still relating to the reel. The caption must not be describint the video like the LLM description does
The LLM can add some appropriate hashtags if it wants to and seem appropriate.
Sometimes, the original caption will credit the original author, most of the times on twitter like (“credit : t/twitteruser”). Those credit can appear in the generated caption too, But I dont want any instagram account mention (“@instagramUser”) because most of the time its to incite to subscribe to the downloaded reel account. The use of emoji is encouraged, but not too much and it has to not look stupid or too.
```
## Video Descriptor
Im using some LLM and I would need a prompt and a system message for every use case I will give you.
The LLM here will be used to describe an Instagram Reel (video). Each screenshot of that video will be described using an LLM, prompt, system message and output format. The description of all the screenshots will be given to this LLM that will try to recreate the video based on the description of the screenshots, and describe the video.
The required prompt here is for the LLM that will compile the description into one and try to understand the video and describe it. Im particularly interested in the joke behind the reel if there is one.
This is an example of a screenshot description by an LLM : “The image shows a close-up of a person's hands holding what appears to be a brown object with a plastic covering, possibly food wrapped in paper or foil. There is also a small portion visible at the top right corner, which seems to be a red and white label. The focus of the image is on the hands holding the object.”
Most of the description wont make sense, so some details should be omitted. For example, one screenshot description could say the main subject is a car, and another one 3 seconds later in the video could say the main subject is a cat. You could say the car transformed into a cat, but it would be safer to assume that one of the description is wrong and the main characted was a cat all along the video because another description in the video also says the main subject is a cat.
It is safe to say that most analysed videos will be of bad quality. which means the screenshots description can vary a lot
### Screenshot descriptor
```
Im using some LLM and I would need a prompt, a system message and an output format for every use case I will give you.
The first one must describe a screenshot from a video. Each screenshot of that video will be described using the same LLM, prompt, system message and output format. The description of all the screenshots will be given to another LLM that will try to recreate the video based on the description of the screenshots, and describe the video.
The required prompt here is the one that describes a screenshot. The LLM will only be given the screenshot as input information. I need the LLM to describe the given screenshot. No need to specify that it is a screenshot. The LLM description must include specify the scene, the character or the main subject, the text present on the screenshots, most of the time it will be caption added after video editing, that may use emojis.
The LLM used here is llava:7b-v1.6-mistral-q4_1, it is not the best for text generation , but it is very prowerful when using its vision capabilty.
```
The last part is personnal, I included it because I gave the prompt to another LLM that the one used because llava would'nt give me a good prompt.

View File

@ -33,7 +33,7 @@ abstract class BrowserJob implements ShouldQueue
public int $jobId;
public $timeout = 500;
public $timeout = 300; // 5 minutes
public function __construct(int $jobId)
{
@ -53,6 +53,7 @@ abstract class BrowserJob implements ShouldQueue
$this->browse(function (Browser $browser) use ($callback, &$log) {
try {
$browser->driver->manage()->timeouts()->implicitlyWait(20);
$log = $callback($browser);
// } catch (Exception $e) {
// $browser->screenshot("failure-{$this->jobId}");
@ -160,7 +161,7 @@ abstract class BrowserJob implements ShouldQueue
'--disable-setuid-sandbox',
'--whitelisted-ips=""',
'--disable-dev-shm-usage',
'--user-data-dir=/home/seluser/profile/',
'--user-data-dir=/home/seluser/profile/nigga/', // seems that selenium doesn't like docker having a volume on the exact same folder ("session not created: probably user data directory is already in use")
])->all());
return RemoteWebDriver::create(
@ -168,7 +169,12 @@ abstract class BrowserJob implements ShouldQueue
DesiredCapabilities::chrome()->setCapability(
ChromeOptions::CAPABILITY,
$options
),
)
->setCapability('timeouts', [
'implicit' => 20000, // 20 seconds
'pageLoad' => 300000, // 5 minutes
'script' => 30000, // 30 seconds
]),
4000,
$this->timeout * 1000
);

View File

@ -17,11 +17,14 @@ use Illuminate\Contracts\Queue\ShouldBeUniqueUntilProcessing;
use Illuminate\Support\Collection;
use Illuminate\Support\Facades\Log;
use Laravel\Dusk\Browser;
use App\Services\AIPrompt\OpenAPIPrompt;
class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProcessing
{
// === CONFIGURATION ===
public $timeout = 1800; // 30 minutes
private const APPROXIMATIVE_RUNNING_MINUTES = 2;
private Collection $jobInfos;
@ -29,6 +32,10 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
protected IInstagramVideoDownloader $videoDownloader;
protected ReelDescriptor $ReelDescriptor;
protected OpenAPIPrompt $openAPIPrompt;
protected string $downloadFolder = "app/Browser/downloads/InstagramRepost/";
/**
@ -40,12 +47,14 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
*/
protected InstagramDescriptionPipeline $descriptionPipeline;
public function __construct($jobId = 4)
public function __construct($jobId = 4, ReelDescriptor $ReelDescriptor = null, OpenAPIPrompt $openAPIPrompt = null)
{
parent::__construct($jobId);
$this->downloadFolder = base_path($this->downloadFolder);
$this->videoDownloader = new YTDLPDownloader();
$this->ReelDescriptor = $ReelDescriptor ?? app(ReelDescriptor::class);
$this->openAPIPrompt = $openAPIPrompt ?? app(OpenAPIPrompt::class);
$this->descriptionPipeline = new InstagramDescriptionPipeline([
// Add steps to the pipeline here
new DescriptionPipeline\RemoveAccountsReferenceStep(),
@ -152,13 +161,17 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
*/
$downloadedReels = [];
foreach ($toDownloadReels as $repost) {
$downloadInfos = $this->downloadReel(
$browser,
$repost
);
$downloadedReels[] = [
$repost,
$this->downloadReel(
$browser,
$repost
)
$downloadInfos
];
$this->describeReel($repost, $downloadInfos);
}
$this->jobRun->addArtifact(new JobArtifact([
@ -278,6 +291,15 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
return $videoInfo;
}
protected function describeReel(InstagramRepost $reel, IInstagramVideo $videoInfo): void
{
// Set the video description to the reel description
$reel->video_description = $this->ReelDescriptor->getDescription($videoInfo->getFilename());
$reel->save();
Log::info("Reel description set: {$reel->reel_id} - {$reel->video_description}");
}
protected function repostReel(Browser $browser, InstagramRepost $reel, IInstagramVideo $videoInfo): bool
{
try {
@ -317,16 +339,17 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
$this->clickNext($browser); // Skip cover photo and trim
// Add a caption
$captionText = $this->descriptionPipeline->process($videoInfo->getDescription());
$captionText = $this->descriptionPipeline->process($this->getReelCaption($reel, $videoInfo));
$this->pasteText($browser, $captionText, 'div[contenteditable]');
sleep(2); // Wait for the caption to be added
if (config("app.environment") !== "local") { // Don't share the post in local environment
if (config("app.env") !== "local") { // Don't share the post in local environment
$this->clickNext($browser); // Share the post
}
sleep(5); // Wait for the post to be completed
sleep(7); // Wait for the post to be completed
$this->removePopups($browser);
// Check if the post was successful
try {
@ -360,6 +383,56 @@ class InstagramRepostJob extends BrowserJob implements ShouldBeUniqueUntilProces
}
}
private function getReelCaption(InstagramRepost $reel, IInstagramVideo $videoInfo): string
{
if (isset($reel->instagram_caption)) {
return $reel->instagram_caption;
}
// Get the reel description from the database or the video info
$reelDescription = $reel->video_description;
$originalDescription = $videoInfo->getDescription();
$llmAnswer = $this->openAPIPrompt->generate(
config('llm.models.chat.name'),
"Original Caption: {$originalDescription}
Video Description/Directive: {$reelDescription}",
[],
outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
systemMessage: "You are an AI assistant specialized in creating engaging and concise Instagram Reel captions. Your primary task is to transform the provided original caption (often from Twitter) and description/directions into a fresh, unique, but still relevant caption for Instagram Reels format.
Key instructions:
1. **Analyze Input:** You will receive two things: an *original reel caption* (usually starting with \"credit:\" or mentioning a Twitter handle like `t/TwitterUser`), and either a *video description* or explicit directions about the joke/idea behind the video.
2. **Transform, Don't Reproduce:** Your output must be significantly different from the original provided caption. It should capture the essence of the content described but phrase it anew often with humor if appropriate.
3. **Keep it Short & Punchy:** Instagram Reels thrive on quick engagement. Prioritize brevity (ideally under two lines, or three lines max) and impact. Make sure your caption is concise enough for fast-scroll viewing.
4. **Maintain the Core Idea:** The new caption must directly relate to the video's content/direction/joke without simply restating it like a description would. Focus on what makes the reel *interesting* or *funny* in its own right.
5. **Preserve Original Credit (Optional):** If an explicit \"credit\" line is provided, you may incorporate this into your new caption naturally, perhaps using `(via...)` or similar phrasing if it fits well and doesn't sound awkward. **Do not** include any original Instagram account mentions (@handles). They are often intended for promotion which isn't our goal.
6. **Use Emoji Judiciously:** Incorporate relevant emojis to enhance the tone (funny, relatable, etc.) or add visual interest. Use them purposefully and in moderation they should complement the caption, not overwhelm it.
7. **Add Hashtags (Optional but Recommended):** Generate a few relevant Instagram hashtags automatically at the end of your output to increase visibility. Keep these organic to the content and avoid forcing irrelevant tags.
Your response structure is as follows:
- The generated caption (your core answer).
- Then, if you generate any hashtags, list them on the next line(s) prefixed with `#`.
Example Input Structure:
Original Caption: credit: t/otherhandle This banana is looking fly today!
Video Description/Directive: A man walks into a store holding a banana and wearing sunglasses. He looks around confidently before leaving.
Your answer should only contain the generated caption, and optionally hashtags if relevant.
Remember to be creative and ensure the generated caption feels like something you would see naturally on an Instagram Reel. Aim for personality and relevance.
",
keepAlive: true,
shouldThink: config('llm.models.chat.shouldThink')
);
$llmAnswer = json_decode($llmAnswer, true)['answer'] ?? null;
if ($llmAnswer !== null) {
$reel->instagram_caption = $llmAnswer;
$reel->save();
Log::info("Reel caption generated: {$reel->reel_id} - {$llmAnswer}");
}
return $llmAnswer;
}
private function clickNext(Browser $browser) {
$nextButton = $browser->driver->findElement(WebDriverBy::xpath('//div[contains(text(), "Next") or contains(text(), "Share")]'));
$nextButton->click();

View File

@ -1,12 +0,0 @@
<?php
namespace App\Browser\Jobs\InstagramRepost;
class OCRLLMReelDescriptor extends \App\FileTools\VideoDescriptor\OCRLLMVideoDescriptor
{
public const DESCRIPTION_PROMPT = "Describe the Instagram reel based on the screenshots. Each screenshot has a timestamp of when in the video the screenshot was taken, an OCR result and a description of the screenshot by an LLM. Do not specify that it is a reel, just try to describe the video and most importantly the joke behind it if there is one. The description must have a maximum of 500 words.\n";
public function __construct() {
parent::__construct();
}
}

View File

@ -0,0 +1,11 @@
<?php
namespace App\Browser\Jobs\InstagramRepost;
use App\Services\AIPrompt\OpenAPIPrompt;
use App\Services\FileTools\OCR\IImageOCR;
class ReelDescriptor extends \App\Services\FileTools\VideoDescriptor\OCRLLMVideoDescriptor
{
public const DESCRIPTION_PROMPT = "Analyze this Instagram Reel sequence. You are given information for each individual screenshot/analysis from the video:";
}

View File

@ -1,117 +0,0 @@
<?php
namespace App\FileTools\VideoDescriptor;
use App\AIPrompt\IAIPrompt;
use App\AIPrompt\OpenAPIPrompt;
use App\FileTools\OCR\IImageOCR;
use App\FileTools\OCR\TesseractImageOCR;
use Log;
class OCRLLMVideoDescriptor implements IVideoDescriptor
{
private IImageOCR $ocr;
private IAIPrompt $llm; // LLM That can visualize images and generate descriptions
public const DESCRIPTION_PROMPT = "Describe the video based on the screenshots. Each screenshot has a timestamp of when in the video the screenshot was taken, an OCR result and a description of the screenshot by an LLM. Do not specify that it is a video, just describe the video. The description must have a maximum of 500 words.\n";
public function __construct() {
$this->ocr = new TesseractImageOCR();
$this->llm = new OpenAPIPrompt();
}
public function getDescription(string $filePath): ?string
{
/*
1. Cut videos in screenshots
2. Use OCR to extract text from screenshots
3. Use LLM to generate a description of the screenshot
4. Combine the descriptions of all screenshots into a single description
5. Ask an LLM to describe the video
*/
// Step 1: Cut video into screenshots
$screenshots = $this->cutVideoIntoScreenshots($filePath);
if (empty($screenshots)) {
throw new \Exception("No screenshots were generated from the video {$filePath}.");
}
// Step 2 & 3: Use OCR to extract text and LLM to get description from screenshots
$descriptions = [];
foreach ($screenshots as $screenshot) {
$descriptions[$screenshot] = [];
$ocrDescription = $this->ocr->performOCR($screenshot);
$ocrDescription = empty($ocrDescription) ? 'No text found' : $ocrDescription;
$descriptions[$screenshot]['ocr'] = $ocrDescription;
$llmDescription = $this->llm->generate(
config('llm.models.vision.name'),
"Describe the content of this screenshot from a video. Do not specify that it is a screenshot, just describe the content.",
images: [$screenshot],
outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
systemMessage: "The user will ask something. Give your direct answer to that.",
keepAlive: $screenshot != end($screenshots), // Keep alive for all but the last screenshot
shouldThink: config('llm.models.vision.shouldThink')
);
$descriptions[$screenshot]['text'] = json_decode($llmDescription, true)['answer'] ?? 'No description generated';
}
// HERE COULD BE SOME INTERMEDIATE PROCESSING OF DESCRIPTIONS
// Step 4: Combine the descriptions of all screenshots into a single description
$combinedDescription = '';
$screenshotCount = 0;
foreach ($descriptions as $screenshot => $description) {
$screenshotCount++;
$combinedDescription .= "Screenshot: {$screenshotCount}\n";
$combinedDescription .= "Timestamp: {$screenshotCount}s\n"; // TODO Cut the video in smaller parts when the video is short
$combinedDescription .= "OCR: {$description['ocr']}\n";
$combinedDescription .= "LLM Description: {$description['text']}\n\n";
}
$combinedDescription = trim($combinedDescription);
// Step 5: Ask an LLM to describe the video based on the combined descriptions
$llmDescription = $this->llm->generate(
config('llm.models.chat.name'),
self::DESCRIPTION_PROMPT . $combinedDescription,
outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
systemMessage: "The user will ask something. Give your direct answer to that.",
keepAlive: true,
shouldThink: config('llm.models.chat.shouldThink')
);
$llmDescription = json_decode($llmDescription, true)['answer'] ?? null;
if (empty($llmDescription)) {
$llmDescription = null;
}
return $llmDescription;
}
/**
* Cut the video into screenshots.
* Using ffmpeg to cut the video into screenshots at regular intervals.
* The screenshots will be saved in a temporary directory.
* @param string $filePath
* @return void
*/
private function cutVideoIntoScreenshots(string $filePath): array
{
$tempDir = sys_get_temp_dir() . '/video_screenshots';
if (!is_dir($tempDir)) {
mkdir($tempDir, 0777, true);
}
Log::info("Cutting video into screenshots: $filePath");
$outputPattern = $tempDir . '/screenshot_%d.png';
$command = "ffmpeg -i " . escapeshellarg($filePath) . " -vf fps=1 " . escapeshellarg($outputPattern);
exec($command);
// Collect all screenshots
$screenshots = glob($tempDir . '/screenshot_*.png');
return $screenshots;
}
}

View File

@ -0,0 +1,27 @@
<?php
namespace App\Providers;
use App\Services\AIPrompt\OpenAPIPrompt;
use Illuminate\Support\ServiceProvider;
class AIPromptServiceProvider extends ServiceProvider
{
/**
* Register services.
*/
public function register(): void
{
$this->app->singleton(OpenAPIPrompt::class, function ($app) {
return new OpenAPIPrompt();
});
}
/**
* Bootstrap services.
*/
public function boot(): void
{
//
}
}

View File

@ -0,0 +1,27 @@
<?php
namespace App\Providers;
use App\Services\FileTools\OCR\IImageOCR;
use Illuminate\Support\ServiceProvider;
class ImageOCRServiceProvider extends ServiceProvider
{
/**
* Register services.
*/
public function register(): void
{
$this->app->singleton(IImageOCR::class, function ($app) {
return new \App\Services\FileTools\OCR\TesseractImageOCR();
});
}
/**
* Bootstrap services.
*/
public function boot(): void
{
//
}
}

View File

@ -0,0 +1,36 @@
<?php
namespace App\Providers;
use App\Services\AIPrompt\OpenAPIPrompt;
use App\Services\FileTools\OCR\IImageOCR;
use App\Services\FileTools\VideoDescriptor\IVideoDescriptor;
use Illuminate\Support\ServiceProvider;
class VideoDescriptorServiceProvider extends ServiceProvider
{
/**
* Register services.
*/
public function register(): void
{
// Register the VideoDescriptor service
$this->app->singleton(IVideoDescriptor::class, function ($app) {
return new \App\Services\FileTools\VideoDescriptor\LLMFullVideoDescriptor(
$app->make(IImageOCR::class),
$app->make(OpenAPIPrompt::class)
);
});
// Register the VideoDescriptor service
$this->app->singleton(\App\Browser\Jobs\InstagramRepost\ReelDescriptor::class);
}
/**
* Bootstrap services.
*/
public function boot(): void
{
//
}
}

View File

@ -1,6 +1,6 @@
<?php
namespace App\AIPrompt;
namespace App\Services\AIPrompt;
interface IAIPrompt
{

View File

@ -1,6 +1,8 @@
<?php
namespace App\AIPrompt;
namespace App\Services\AIPrompt;
use Uri;
/**
* Use OpenAI API to get answers from a model.
@ -8,15 +10,20 @@ namespace App\AIPrompt;
class OpenAPIPrompt implements IAIPrompt
{
private string $host;
private ?string $token = null;
public function __construct(string $host = null) {
$this->host = $host ?? config('llm.host');
$this->host = $host ?? config('llm.api.host');
if (config('llm.api.token')) {
$this->token = config('llm.api.token');
}
}
private function getHeaders(): array
{
return [
'Content-Type' => 'application/json'
'Authorization: ' . ($this->token ? 'Bearer ' . $this->token : ''),
'Content-Type: application/json',
];
}
@ -39,6 +46,7 @@ class OpenAPIPrompt implements IAIPrompt
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
throw new \Exception("Error calling OpenAI API: HTTP $httpCode - $response");
}
@ -88,7 +96,7 @@ class OpenAPIPrompt implements IAIPrompt
Important
It's important to instruct the model to use JSON in the prompt. Otherwise, the model may generate large amounts whitespace.
**It's important to instruct the model to use JSON in the prompt. Otherwise, the model may generate large amounts whitespace.**
*/
// Transform the images to base64

View File

@ -1,6 +1,6 @@
<?php
namespace App\FileTools\OCR;
namespace App\Services\FileTools\OCR;
interface IImageOCR
{

View File

@ -1,6 +1,6 @@
<?php
namespace App\FileTools\OCR;
namespace App\Services\FileTools\OCR;
use thiagoalessio\TesseractOCR\TesseractOCR;
class TesseractImageOCR implements IImageOCR

View File

@ -0,0 +1,59 @@
<?php
namespace App\Services\FileTools\VideoDescriptor;
use App\Services\FileTools\VideoDescriptor\IVideoDescriptor;
use Log;
abstract class AbstractLLMVideoDescriptor implements IVideoDescriptor
{
public const MAX_FRAMES = 5;
abstract public function getDescription(string $filePath): ?string;
/**
* Cut the video into screenshots.
* Using ffmpeg to cut the video into screenshots at regular intervals.
* The screenshots will be saved in a temporary directory.
* @param string $filePath
* @return array array with timestamps as key and screenshot file paths as values.
*/
protected function cutVideoIntoScreenshots(string $filePath): array
{
$tempDir = sys_get_temp_dir() . '/video_screenshots';
if (!is_dir($tempDir)) {
mkdir($tempDir, 0777, true);
}
else {
// Clear the directory if it already exists
array_map('unlink', glob($tempDir . '/*'));
}
Log::info("Cutting video into screenshots: $filePath");
$videoDuration = shell_exec("ffprobe -v error -show_entries format=duration -of csv=p=0 " . escapeshellarg($filePath));
if ($videoDuration === null) {
Log::error("Failed to get video duration for file: $filePath");
return [];
}
$videoDuration = floatval($videoDuration);
$framesInterval = ceil($videoDuration / self::MAX_FRAMES);
$fps = 1/$framesInterval; // Frames per second for the screenshots
$outputPattern = $tempDir . '/screenshot_%d.png';
$command = "ffmpeg -i " . escapeshellarg($filePath) . " -vf fps={$fps} " . escapeshellarg($outputPattern);
exec($command);
// Collect all screenshots
$screenshots = glob($tempDir . '/screenshot_*.png');
$array = [];
foreach ($screenshots as $screenshot) {
$array[] = [
"screenshot" => $screenshot,
"timestamp" => floor(sizeof($array) * $framesInterval),
];
}
return $array;
}
}

View File

@ -1,6 +1,6 @@
<?php
namespace App\FileTools\VideoDescriptor;
namespace App\Services\FileTools\VideoDescriptor;
interface IVideoDescriptor
{

View File

@ -0,0 +1,64 @@
<?php
namespace App\Services\FileTools\VideoDescriptor;
use App\Services\AIPrompt\OpenAPIPrompt;
use App\Services\FileTools\OCR\IImageOCR;
class LLMFullVideoDescriptor extends AbstractLLMVideoDescriptor implements IVideoDescriptor
{
public const DESCRIPTION_PROMPT = "Describe the video based on the screenshots. Each screenshot has a timestamp of when in the video the screenshot was taken. Do not specify that it is a video, just describe the video. Do not describe the screenshots one by one, try to make sense out of all the screenshots, what could be the video about ? What capion is attached to the video ? is it a meme ? If yes, what is the joke ? Be the most descriptive without exceeding 5000 words.\n";
public function __construct(public IImageOCR $ocr, public OpenAPIPrompt $llm) {
}
public function getDescription(string $filePath): ?string
{
/*
1. Cut videos in screenshots
2. Ask an LLM to describe the video with all the screenshots
*/
// Step 1: Cut video into screenshots
$screenshots = $this->cutVideoIntoScreenshots($filePath);
if (empty($screenshots)) {
throw new \Exception("No screenshots were generated from the video {$filePath}.");
}
// Step 4: Combine the descriptions of all screenshots into a single description
$combinedDescription = '';
$screenshotCount = 0;
foreach ($screenshots as $values) {
$screenshot = $values['screenshot'];
$timestamp = $values['timestamp'];
$screenshotCount++;
$combinedDescription .= "Screenshot: {$screenshotCount}\n";
$combinedDescription .= "Timestamp: {$timestamp}s\n"; // TODO Cut the video in smaller parts when the video is short
$ocrDescription = $this->ocr->performOCR($screenshot);
$ocrDescription = empty($ocrDescription) ? 'No text found' : $ocrDescription;
$combinedDescription .= "OCR: {$ocrDescription}\n"; // Perform OCR on the screenshot
$combinedDescription .= "\n";
}
$combinedDescription = trim($combinedDescription);
// Step 5: Ask an LLM to describe the video based on the combined descriptions
$llmDescription = $this->llm->generate(
config('llm.models.vision.name'),
static::DESCRIPTION_PROMPT . $combinedDescription,
images: array_map(function ($screenshot) {return $screenshot["screenshot"];}, $screenshots), // Pass the screenshots to the LLM
outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
systemMessage: "The user will ask something. Give your direct answer to that.",
keepAlive: true,
shouldThink: config('llm.models.vision.shouldThink')
);
$llmDescription = json_decode($llmDescription, true)['answer'] ?? null;
if (empty($llmDescription)) {
$llmDescription = null;
}
return $llmDescription;
}
}

View File

@ -0,0 +1,144 @@
<?php
namespace App\Services\FileTools\VideoDescriptor;
use App\Services\AIPrompt\OpenAPIPrompt;
use App\Services\FileTools\OCR\IImageOCR;
class OCRLLMVideoDescriptor extends AbstractLLMVideoDescriptor implements IVideoDescriptor
{
public const DESCRIPTION_PROMPT = "Analyze this Video sequence. You are given information for each individual screenshot/analysis from the video:";
public function __construct(public IImageOCR $ocr, public OpenAPIPrompt $llm) {
}
public function getDescription(string $filePath): ?string
{
/*
1. Cut videos in screenshots
2. Use OCR to extract text from screenshots
3. Use LLM to generate a description of the screenshot
4. Combine the descriptions of all screenshots into a single description
5. Ask an LLM to describe the video
*/
// Step 1: Cut video into screenshots
$screenshots = $this->cutVideoIntoScreenshots($filePath);
if (empty($screenshots)) {
throw new \Exception("No screenshots were generated from the video {$filePath}.");
}
// Step 2 & 3: Use OCR to extract text and LLM to get description from screenshots
$descriptions = [];
foreach ($screenshots as $values) {
$screenshot = $values['screenshot'];
$timestamp = $values['timestamp'];
$descriptions[$screenshot] = [];
$ocrDescription = $this->ocr->performOCR($screenshot);
$ocrDescription = empty($ocrDescription) ? 'No text found' : $ocrDescription;
$descriptions[$screenshot]['ocr'] = $ocrDescription;
dump($ocrDescription); // DEBUG
$llmDescription = $this->llm->generate(
config('llm.models.vision.name'),
"Describe this image in detail, breaking it down into distinct parts as follows:
1. **Scene Description:** Describe the overall setting and environment of the image (e.g., forest clearing, futuristic city street, medieval castle interior).
2. **Main Subject/Character(s):** Detail what is happening with the primary character or subject present in the frame.
3. **Text Description (if any):** If there are visible text elements (like words, letters, captions), describe them exactly as they appear and note their location relative to other elements. This includes any emojis used in captions, describing their visual appearance and likely meaning.
4. **Summary:** Briefly summarize the key content of the image for clarity.
5. **Joke:** If the image is part of a meme or humorous content, describe the joke or humorous element present in the image. Do not include this part if you are not sure to understand the joke/meme.
Format your response strictly using numbered lines corresponding to these four points (1., 2., 3., 4., 5.). Do not use markdown formatting or extra text outside these lines; simply list them sequentially as plain text output.",
images: [$screenshot],
outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
systemMessage: "You are an image understanding AI specialized in describing visual scenes accurately and concisely. Your task is solely to describe the content of the provided image based on what you can visually perceive.
Please analyze the image carefully and provide a description focusing purely on the visible information without generating any text about concepts, interpretations, or future actions beyond the immediate scene. Describe everything that is clearly depicted.",
keepAlive: $screenshot != end($screenshots), // Keep alive for all but the last screenshot
shouldThink: config('llm.models.vision.shouldThink')
);
dump($llmDescription); // DEBUG
$descriptions[$screenshot]['text'] = json_decode($llmDescription, true)['answer'] ?? 'No description generated';
}
// HERE COULD BE SOME INTERMEDIATE PROCESSING OF DESCRIPTIONS
// Step 4: Combine the descriptions of all screenshots into a single description
$combinedDescription = '';
$screenshotCount = 0;
foreach ($screenshots as $values) {
$screenshot = $values['screenshot'];
$timestamp = $values['timestamp'];
$screenshotCount++;
$description = $descriptions[$screenshot] ?? [];
$combinedDescription .= "Screenshot: {$screenshotCount}\n";
$combinedDescription .= "Timestamp: {$timestamp}s\n"; // TODO Cut the video in smaller parts when the video is short
$combinedDescription .= "OCR: {$description['ocr']}\n";
$combinedDescription .= "LLM Description: {$description['text']}\n";
$combinedDescription .= "\n";
}
$combinedDescription = trim($combinedDescription);
// Step 5: Ask an LLM to describe the video based on the combined descriptions
$llmDescription = $this->llm->generate(
config('llm.models.chat.name'),
static::DESCRIPTION_PROMPT . $combinedDescription . "\n\nBased only on these frame analyses, please provide:
A single, concise description that captures the main action or theme occurring in the reel across all frames.
Identify and describe any joke or humorous element present in the video if you can discern one.
Important Considerations
Remember that most videos are of poor quality; frame descriptions might be inaccurate, vague, or contradictory due to blurriness or fast cuts.
Your task is synthesis: focus on the overall impression and sequence, not perfecting each individual piece of information. Some details mentioned in one analysis may simply be incorrect or misidentified from another perspective.
Analyze all provided frames (separated by --- for clarity) to understand what's happening. Then, synthesize this understanding into point 1 above and identify the joke if present as per point 2.",
outputFormat: '{"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}',
systemMessage: "You are an expert social media content analyst specializing in interpreting Instagram Reels. Your primary function is to generate a comprehensive description and identify any underlying humor or joke in a given video sequence. You will be provided with individual frame analyses, each containing:
Screenshot Number: The sequential number of the frame.
Timestamp: When that specific frame occurs within the reel.
OCR Text Result: Raw text extracted from the image content using OCR (Optical Character Recognition), which may contain errors or misinterpretations (\"may appear\" descriptions).
LLM Description of Screenshot: A textual interpretation of what's visible in the frame, based on previous LLM processing.
Please note:
The individual frame analyses can be inconsistent due to low video quality (e.g., blurriness) or rapid scene changes where details are hard to distinguish.
Your task is not to perfect each frame description but to understand the overall sequence and likely narrative, focusing on identifying any joke, irony, absurdity, or humorous transformation occurring across these frames.
Your response should be structured as follows:
Overall Video Description: Provide a concise summary of what happens in the reel based on the combined information from all the provided screenshots.
Humor/Joke Identification (If Applicable): If you can discern any joke or humorous element, explicitly state it and explain how the sequence of frames contributes to this.
Instructions for Synthesis:
Focus on identifying recurring elements, main subject(s), consistent actions/actions that seem unlikely (potential contradiction).
Look for patterns where details change rapidly or absurdly.
Prioritize information from descriptions over relying solely on OCR text if the description seems more plausible. Ignore minor inconsistencies between frames unless they clearly contradict a central theme or joke premise.
Be ready to point out where the humor lies, which might involve unexpected changes, wordplay captured by OCR errors in the context of the visual action described, absurdity, or irony.",
keepAlive: true,
shouldThink: config('llm.models.chat.shouldThink')
);
$llmDescription = json_decode($llmDescription, true)['answer'] ?? null;
if (empty($llmDescription)) {
$llmDescription = null;
}
dump($llmDescription); // DEBUG
return $llmDescription;
}
}

View File

@ -1,7 +1,10 @@
<?php
return [
App\Providers\AIPromptServiceProvider::class,
App\Providers\AppServiceProvider::class,
App\Providers\BrowserJobsServiceProvider::class,
App\Providers\ImageOCRServiceProvider::class,
App\Providers\TelescopeServiceProvider::class,
App\Providers\VideoDescriptorServiceProvider::class,
];

View File

@ -2,10 +2,21 @@
return [
/**
* Host for the OpenAI API.
* This should be the base URL of the OpenAI API you are using.
* API configuration
*/
'host' => env('LLM_HOST_URL', null),
'api' => [
/**
* Host for the OpenAI API.
* This should be the base URL of the OpenAI API you are using.
*/
'host' => env('LLM_API_HOST_URL', null),
/**
* Token for authenticating with the OpenAI API.
* Null if not used
*/
'token' => env('LLM_API_TOKEN', null),
],
/**
* Models configuration.

View File

@ -0,0 +1,32 @@
<?php
use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;
return new class extends Migration
{
/**
* Run the migrations.
*/
public function up(): void
{
Schema::table('instagram_reposts', function (Blueprint $table) {
$table->text('video_description')->nullable()->after('reel_id')
->comment('Description of the video being reposted on Instagram');
$table->text('instagram_caption')->nullable()->after('video_description')
->comment('Caption generated for the Instagram video repost');
});
}
/**
* Reverse the migrations.
*/
public function down(): void
{
Schema::table('instagram_reposts', function (Blueprint $table) {
$table->dropColumn('video_description');
$table->dropColumn('instagram_caption');
});
}
};