All checks were successful
Push image to registry / build-image (push) Successful in 5m59s
12 KiB
12 KiB
What is this file ?
This file provides the user prompts used to get the prompts directly from the LLM used to give answers.
For example, for the Instagram reel caption generation, here will be listed a prompt that asks the LLM to give the prompt, system message and output format that will be used in the Instagram reel caption generation.
This method comes from the idea that the best way to prompt engineer is to ask the concerned model to generate it directly.
Prompts
Starting sentence is usually :
I’m using some LLM and I would need a prompt and a system message for every use case I will give you.
I’m using structured JSON output provided by the openAI API. The output structure is a simple {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}, so only “answer” can be filled. For the input, everything will be given in the prompt. Give me the system message and prompt separately, preferably in text format.
Instagram Reel caption generation
I’m using some LLM and I would need a prompt, a system message for every use case I will give you.
The first one is when I’m trying to generate a caption for an instagram Reel. For the moment, I can give the LLM the original instagram reel caption that was downloaded from, and a description by an LLM of the video, or the joke behind it.
The caption must be short and well placed with the reel. For example, if the reel is funny, the caption must be short and funny, while still relating to the reel. The caption must not be describing the video like the LLM description does (for example this bad example describe the content of the video instead of doing a caption based on the description given : “Three animated friends chilling in the woods at night until someone's phone inevitably starts ringing somewhere nearby... 😅🌲✨” or this one : “This reel shows me trying to make my sad texts shorter with ChatGPT, but it just frustrates me more! 😅😂”).
It also shouldn’t begin with something like ‘This reel…’. For example this is a bad output : “This reel hilariously mocks every awkward fan reaction to those intense DCU movie scenes. 🎭 #DCFanDrama”
The LLM can add some appropriate hashtags if it wants to and seem appropriate.
Sometimes, the original caption will credit the original author, most of the times on twitter like (“credit : t/twitteruser”). Those credit can appear in the generated caption too, But I don’t want any instagram account mention (“@instagramUser”) because usually it’s to incite to subscribe to the downloaded reel account (like “Seen me already ? follow me @instagramUser”). I don’t want long credits too, juste a simple “credit tt/twitteraccount” is enough. Not like this bad example : “Credited via the brilliant mind at tt/batinterface!…”
The use of emoji is encouraged, but not too much and it has to not look stupid or too.
When using it, I encoutered some problems like this one :
“Credit to: [Original Creator] for this hilarious video game scene where the characters look suspiciously like Kermit the Frog! 😂”. The [Original creator] is not filled in, I don’t even know if the original caption had one.
Also sometimes the results says something about the OCR, of course, it shouldn't say anything about the input being wrong like the OCR not making sense in the final answer. The entire text the LLM produces will be set as caption.
Some caption are just lame and feels like a facebook post. The intended audience here is young.
Video Descriptor
I’m using some LLM and I would need a prompt and a system message for every use case I will give you.
I’m using structured JSON output provided by the openAI API. The output structure is a simple {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}, so only “answer” can be filled. For the input, everything will be given in the prompt. Give me the system message and prompt separately, preferably in text format.
The LLM here will be used to describe an Instagram Reel (video). Each screenshot of that video will be described using an LLM, prompt, system message and output format. The description of all the screenshots will be given to this LLM that will try to recreate the video based on the description of the screenshots, and describe the video.
The required prompt here is for the LLM that will compile the description into one and try to understand the video and describe it. I’m particularly interested in the joke behind the reel if there is one.
This is an example of a screenshot description by an LLM : “The image shows a close-up of a person's hands holding what appears to be a brown object with a plastic covering, possibly food wrapped in paper or foil. There is also a small portion visible at the top right corner, which seems to be a red and white label. The focus of the image is on the hands holding the object.”
The informations given in the prompts are :
- An audio transcription of the full video
- For each of the screenshots :
The screenshot number ("Screenshot: 3" for example)
The timestamp in the video of when the screenshot is taken
An OCR result (may contain some weird character, the OCR is not filtered or cleansed), if not text is found, it mentions "No text found"
The LLM description of the screenshot
Here is an example of prompt given :
\```text
Audio Transcription: Hey !
Screenshot: 1
Timestamp: 0s
OCR: See
fia) oiled 5 genuinely fh
LLM Description: 1. Scene Description: The image shows a person standing inside a building, possibly a store or a commercial establishment. The individual is positioned in the middle of the frame and appears to be looking directly at the camera or the viewer. There are other people present as well, but they are not the main focus of this description. 2. Main Subject/Character(s): The primary subject is a person who seems to be trying to communicate with someone off-camera. They appear to be standing in line or waiting, and their posture suggests that they may be impatient or frustrated. 3. Text Description: There is visible text on the image, which reads as follows: 'when my day 1 try to dap me up but he's gonnily do it again when I step out that door love you os 4. Summary: The image captures a moment of frustration or impatience between two individuals in an indoor setting. 5. Joke: It seems there is no joke or humorous element present in this image.
◀
Screenshot: 2
Timestamp: 2s
OCR: we v),®
When my Civ 1 tay to cep
movphwthes ganuinahy th
tune)witithatlovejisiand|ps)
(o
W
LLM Description: 1. **Scene Description:** The image shows an interior setting, which appears to be a spacious room with a high ceiling and a patterned floor. There is natural light coming in from the upper part of the space. The room seems to have some sort of event or gathering happening within it. 2. **Main Subject/Character(s):** The main subject is a person who is standing, walking across the room. They are wearing dark clothing and have their back turned towards the camera. 3. **Text Description (if any):** There is visible text overlaid on the image which reads
◀
Screenshot: 3
Timestamp: 4s
OCR: a up erent a
ae i Li bs
ree
I
LLM Description: 1. Scene Description: The image appears to be a smartphone screenshot of a social media post, specifically an Instagram story. There is a person in the foreground, who seems to be outdoors. The individual is standing near a storefront with a visible display window featuring mannequins and merchandise. The sky is clear, suggesting it might be daytime. The background also shows other people walking on the sidewalk, which indicates that this is likely an urban area. The presence of the store and the sidewalk suggest that this scene takes place in a commercial or shopping district. There are texts at the top of the image that appear to be part of Instagram's interface elements. It seems that there might have been some interaction with the post, as indicated by the emoji reactions. The overall setting appears to be an urban environment during daytime. 2. Main Subject/Character(s): The main subject in this image is a person who is standing in front of a storefront. This individual appears to be engaged with their smartphone, possibly viewing or interacting with the post. It's not possible to provide specific details about the person beyond what they are wearing and how they are positioned within the frame. 3. Text Description (if any): The image contains several text elements that include emoji reactions and captions. There are emojis indicating various types of interactions, such as
◀
Screenshot: 4
Timestamp: 6s
OCR: a day 1 try to.dap
but he’s Ce in
RR OU oS
LLM Description: 1. **Scene Description:** The image shows an interior space that appears to be a shopping area or mall, with visible merchandise and store displays. A person is walking through the scene, seemingly captured mid-step while using their phone. The setting suggests a casual, everyday environment. 2. **Main Subject/Character(s):** The main subject is a person in mid-stride, looking down at their cell phone, possibly engaged with it. There are no additional characters or significant interactions depicted. 3. **Text Description (if any):** There is text overlaid on the image which reads,
◀
Screenshot: 5
Timestamp: 8s
OCR: No text found
LLM Description: 1. Scene Description: The image depicts a person walking through an indoor shopping mall. It appears to be a public space, with visible storefronts and a ceiling-mounted security camera in the background. There is also an escalator in the scene, which suggests multiple levels to the building. The lighting and layout suggest a modern, clean design typical of contemporary malls. 2. Main Subject/Character(s): A young man is walking through the shopping mall, seemingly alone, with his head down. He appears to be engaged with something he's holding in his hand, possibly a mobile phone, which might imply that he is texting or looking at content on his device. 3. Text Description (if any): There is text overlaid on the image. It reads,
\```
Most of the description won’t make sense, so some details should be omitted. For example, one screenshot description could say the main subject is a car, and another one 3 seconds later in the video could say the main subject is a cat. You could say the car transformed into a cat, but it would be safer to assume that one of the description is wrong and the main characted was a cat all along the video because another description in the video also says the main subject is a cat.
It is safe to say that most analysed videos will be of bad quality. which means the screenshots description can vary a lot.
Found text by OCR and screenshots descriptions can be passed to the final video description if it seems coherent.
Screenshot descriptor
I’m using some LLM and I would need a prompt, a system message and an output format for every use case I will give you.
The first one must describe a screenshot from a video. Each screenshot of that video will be described using the same LLM, prompt, system message and output format. The description of all the screenshots will be given to another LLM that will try to recreate the video based on the description of the screenshots, and describe the video.
The required prompt here is the one that describes a screenshot. The LLM will only be given the screenshot as input information. I need the LLM to describe the given screenshot. No need to specify that it is a screenshot. The LLM description must include specify the scene, the character or the main subject, the text present on the screenshots, most of the time it will be caption added after video editing, that may use emojis.
The LLM used here is llava:7b-v1.6-mistral-q4_1, it is not the best for text generation , but it is very prowerful when using it’s vision capabilty.
The last part is personnal, I included it because I gave the prompt to another LLM that the one used because llava would'nt give me a good prompt.