Power of ChatGPT for Enhanced Video and Image Interpretation

Make chatGPT understand from pictures and videos, and also allow chatgpt to enhance and make changes to photos and video.

Apr 21, 2023

∙ Paid

In recent years, artificial intelligence has made leaps and bounds in providing solutions to complex tasks, particularly in the realm of image editing and understanding. This project is a groundbreaking project that aims to further revolutionize this field by seamlessly integrating ChatGPT and Foundation Models into a single, powerful platform.

This brings together the general interface of ChatGPT, a large language model capable of understanding a diverse range of topics, with the domain expertise of Foundation Models, which offer deep knowledge in specific domains. This unique combination aims to create an AI system that can handle a multitude of tasks effectively and efficiently.

In this article, we will delve into the core functionalities of TaskMatrix, discussing its innovative use of GroundingDINO, segment-anything, and stable diffusion inpainting for image editing, as well as its support for multiple languages and the introduction of templates for enhanced AI collaboration. We will explore how this project harnesses the power of both ChatGPT and Foundation Models to enable users to access and work with visual information in an entirely new way, opening up a world of possibilities for both individual and industry applications.

Imports

# coding: utf-8
import os
import gradio as gr
import random
import torch
import cv2
import re
import uuid
from PIL import Image, ImageDraw, ImageOps, ImageFont
import math
import numpy as np
import argparse
import inspect
import tempfile
from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
from transformers import pipeline, BlipProcessor, BlipForConditionalGeneration, BlipForQuestionAnswering
from transformers import AutoImageProcessor, UperNetForSemanticSegmentation

Several libraries and modules are imported in this code for various purposes, such as image processing, segmenting semantically, and answering questions. Also included in the code are several variables and functions that are used throughout the script.

With the os module, you can read or write to the file system depending on the operating system. Gradio is a Python library that makes it easy to create customizable UI components for your machine learning models. In the random module, random numbers can be generated using functions. Torch provides GPU-accelerated tensor computations with the PyTorch library. cv2 is a Python wrapper for OpenCV (Open Source Computer Vision Library), which is used to process images. Regular expression matching is provided by the re module. UUIDs are generated using the uuid module. You can add image processing capabilities to your Python interpreter by using the PIL (Python Imaging Library) module. C defines mathematical functions in the math module. Numpy is a Python library that supports large multidimensional arrays and matrices. Command-line interfaces can be written quickly and easily using the argparse module. Inspect provides several useful functions for getting information about live objects, including modules, classes, methods, functions, tracebacks, frame objects, and code objects. Temporary files and directories can be created using the tempfile module.

Furthermore, the code imports several modules from Hugging Face Transformers, a state-of-the-art natural language processing library with easy-to-use interfaces to several popular NLP models. Among these modules are CLIPSegProcessor, CLIPSegForImageSegmentation, BlipProcessor, BlipForConditionalGeneration, BlipForQuestionAnswering, and AutoImageProcessor.

There are several libraries and modules imported in this code, such as those used for image processing and natural language processing. As well as defining variables and functions, the script uses a number of them throughout.

from diffusers import StableDiffusionPipeline, StableDiffusionInpaintPipeline, StableDiffusionInstructPix2PixPipeline
from diffusers import EulerAncestralDiscreteScheduler
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from controlnet_aux import OpenposeDetector, MLSDdetector, HEDdetector

Two Python modules, diffusers and controlnet_aux, are imported into the code being discussed. In the code, several classes related to image processing and computer vision are imported from the diffusers module, including a diffusion process for image processing, image inpainting, an encoder-decoder network for image-to-image translation, and a diffusion process scheduler. For image processing and denoising, the code also imports classes for implementing a stable diffusion process. The code imports several classes from the controlnet_aux module, including an OpenPose-based human pose detector, a multi-scale Laplacian pyramid salient object detector, and an edge detection algorithm. A set of classes and functions is imported by this code to perform a variety of image processing and computer vision algorithms, including diffusion-based picture processing, image inpainting, and image-to-image translation.

from langchain.agents.initialize import initialize_agent
from langchain.agents.tools import Tool
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.llms.openai import OpenAI

Langchain is a package that contains four classes imported from different modules.

It imports the initialize_agent function from the initialize module. Natural language processing tasks likely require this function to create and set up a language model agent.

It imports the Tool class from the tools module. Most likely, this class provides useful functions and methods for working with natural language processing.

The ConversationBufferMemory class is imported from the conversation.memory module. In order for the language model agent to learn and adapt from past interactions, this class likely stores and manages previous conversations.

The OpenAI class is imported from the llms.openai module. It is likely that this class implements the OpenAI API, which provides access to a variety of OpenAI’s natural language processing models and tools.

Several classes are imported in this code that are probably used in building and training language model agents, as well as interacting with APIs like OpenAI.

# Grounding DINO
import groundingdino.datasets.transforms as T
from groundingdino.models import build_model
from groundingdino.util import box_ops
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap

A package called groundingdino is imported in this code.

A number of functions for transforming image data are imported from the datasets.transforms module.

A DINO model is likely to be built and returned from the models module’s build_model function.

Two classes are imported from the util module: box_ops and SLConfig. SLConfig is likely a configuration class for storing and managing image grounding configuration settings, while the box_ops class contains utility functions for bounding box operations.

There are two functions imported from the util.utils module: clean_state_dict and get_phrases_from_posmap. Get_phrases_from_posmap likely extracts phrases or text data from a positional mapping of the image, while clean_state_dict likely cleans or preprocesses a PyTorch state dictionary.

In general, this code imports modules and functions used to build and train DINO models for image grounding tasks, including preprocessing and cleaning of data, model building, and configuration management functions.

# segment anything
from segment_anything import build_sam, SamPredictor, SamAutomaticMaskGenerator
import cv2
import numpy as np
import matplotlib.pyplot as plt
import wget

Image segmentation is likely to be achieved with the help of several modules and packages imported into this code.

There are three classes imported from segment_anything: build_sam, SamPredictor, and SamAutomaticMaskGenerator. Based on the Segment Anything Model (SAM), these classes are likely used to build and use segmentation models.

A Python wrapper for OpenCV is imported from the cv2 package via the cv2 module. Various image processing tools and functions are likely to be provided by this module.

In Python, the np module is an abbreviation for the numpy package, which is imported from the numpy package. A number of functions are likely to be provided by this module for working with arrays and matrices used in image processing.

Plotting image data is likely achieved using the plt module imported from matplotlib.pyplot.

In addition to importing the wget package, it also imports the wget module, which probably provides functionality to download files from the Internet.

A set of modules and classes likely import this code, along with various image processing and visualization tools, used to build and train a segmentation model based on SAM. It is also possible to use the wget package to download image data and related files required for segmentation.

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    return seed

Using seed_everything, a Python function that takes one parameter is called. As a machine learning project uses a variety of libraries and modules, this function is likely to be used to set the random seed value.

Using the seed value as a parameter, the function first calls the random.seed function from the random module. Python’s built-in random number generator uses this seed to generate random numbers.

After that, it calls the numpy package’s np.random.seed function, passing a seed value as its argument. NumPy’s random number generator is seeded with this value.

By passing the seed value to torch.manual_seed, the PyTorch library calls the torch.manual_seed function. PyTorch’s random number generator is seeded with this value.

As a final step, it returns the seed value to PyTorch’s torch.cuda.manual_seed_all function. In PyTorch, a GPU is used to generate random numbers with CUDA.

Although the returned value doesn’t appear to serve any particular purpose beyond confirming that the seed was successfully set, the function returns the seed value that was passed in. By setting a random seed value each time, this function ensures a machine learning model’s training process is reproducible and deterministic across runs.

def prompts(name, description):
    def decorator(func):
        func.name = name
        func.description = description
        return func

    return decorator

The prompts function in Python takes two parameters, name and description. Using this function, you can create a decorator function, which returns another function.

There is only one parameter in the decorator function, func, which is the function to decorate. Using this decorator function, you can provide information about the function’s purpose and usage by adding the name and description attributes.

The func function is assigned the name and description attributes using dot notation within the decorator function. Using the name parameter, the name attribute is assigned the value of the name parameter, and using the description parameter, the description attribute is assigned the value of the description parameter.

As a final step, the decorator function adds the name and description attributes to the func function.

This function is likely used to decorate other functions that require additional metadata such as name and description. By using the @ syntax, you can add the name and description attributes to a function. As a result, other developers may be able to understand the function’s purpose and how it should be used more easily.

Visual ChatGPT

def blend_gt2pt(old_image, new_image, sigma=0.15, steps=100):
    new_size = new_image.size
    old_size = old_image.size
    easy_img = np.array(new_image)
    gt_img_array = np.array(old_image)
    pos_w = (new_size[0] - old_size[0]) // 2
    pos_h = (new_size[1] - old_size[1]) // 2

    kernel_h = cv2.getGaussianKernel(old_size[1], old_size[1] * sigma)
    kernel_w = cv2.getGaussianKernel(old_size[0], old_size[0] * sigma)
    kernel = np.multiply(kernel_h, np.transpose(kernel_w))

    kernel[steps:-steps, steps:-steps] = 1
    kernel[:steps, :steps] = kernel[:steps, :steps] / kernel[steps - 1, steps - 1]
    kernel[:steps, -steps:] = kernel[:steps, -steps:] / kernel[steps - 1, -(steps)]
    kernel[-steps:, :steps] = kernel[-steps:, :steps] / kernel[-steps, steps - 1]
    kernel[-steps:, -steps:] = kernel[-steps:, -steps:] / kernel[-steps, -steps]
    kernel = np.expand_dims(kernel, 2)
    kernel = np.repeat(kernel, 3, 2)

    weight = np.linspace(0, 1, steps)
    top = np.expand_dims(weight, 1)
    top = np.repeat(top, old_size[0] - 2 * steps, 1)
    top = np.expand_dims(top, 2)
    top = np.repeat(top, 3, 2)

    weight = np.linspace(1, 0, steps)
    down = np.expand_dims(weight, 1)
    down = np.repeat(down, old_size[0] - 2 * steps, 1)
    down = np.expand_dims(down, 2)
    down = np.repeat(down, 3, 2)

    weight = np.linspace(0, 1, steps)
    left = np.expand_dims(weight, 0)
    left = np.repeat(left, old_size[1] - 2 * steps, 0)
    left = np.expand_dims(left, 2)
    left = np.repeat(left, 3, 2)

    weight = np.linspace(1, 0, steps)
    right = np.expand_dims(weight, 0)
    right = np.repeat(right, old_size[1] - 2 * steps, 0)
    right = np.expand_dims(right, 2)
    right = np.repeat(right, 3, 2)

    kernel[:steps, steps:-steps] = top
    kernel[-steps:, steps:-steps] = down
    kernel[steps:-steps, :steps] = left
    kernel[steps:-steps, -steps:] = right

    pt_gt_img = easy_img[pos_h:pos_h + old_size[1], pos_w:pos_w + old_size[0]]
    gaussian_gt_img = kernel * gt_img_array + (1 - kernel) * pt_gt_img  # gt img with blur img
    gaussian_gt_img = gaussian_gt_img.astype(np.int64)
    easy_img[pos_h:pos_h + old_size[1], pos_w:pos_w + old_size[0]] = gaussian_gt_img
    gaussian_img = Image.fromarray(easy_img)
    return gaussian_img

An old_image parameter, a new_image parameter, sigma parameter, and steps parameter are required for the Python function blend_gt2pt to function.

By blending an old (ground truth) image with a new (predicted) image, this function creates a blurred effect. The strength of multiple images can be combined to produce a more accurate or complete result in various computer vision tasks, such as object detection or image segmentation.

It creates a NumPy array from the new_image input after calculating the sizes of the two input images.

A Gaussian kernel is then created using the OpenCV library’s cv2.getGaussianKernel function. Based on the size of the old_image input, the kernel size is calculated.

After creating four weight arrays (top, down, left, and right), the function adjusts the Gaussian kernel’s weighting based on the image’s edges.

Using these weight arrays coupled with the Gaussian kernel, we create a blended image with a blurred effect from the old_image and new_image inputs. Image.fromarray converts the blended image back to an image object, which is then returned.

A Gaussian kernel and edge weighting are likely used in this function to blend ground truth and predicted images in computer vision tasks.

def cut_dialogue_history(history_memory, keep_last_n_words=500):
    if history_memory is None or len(history_memory) == 0:
        return history_memory
    tokens = history_memory.split()
    n_tokens = len(tokens)
    print(f"history_memory:{history_memory}, n_tokens: {n_tokens}")
    if n_tokens < keep_last_n_words:
        return history_memory
    paragraphs = history_memory.split('\n')
    last_n_tokens = n_tokens
    while last_n_tokens >= keep_last_n_words:
        last_n_tokens -= len(paragraphs[0].split(' '))
        paragraphs = paragraphs[1:]
    return '\n' + '\n'.join(paragraphs)

There are two parameters to this Python function, history_memory and keep_last_n_words: history_memory and keep_last_n_words.

By keeping only the last keep_last_n_words words, this function can trim down long dialogue histories to a manageable length. It may be necessary to limit how much context a model is considering at any given time in different natural language processing tasks, such as chatbots.

A first check is made to see if the history_memory input is None or zero in length. In this case, the input is immediately returned.

In the next step, it splits the history_memory input into individual tokens and counts the total number of tokens.

When there are fewer tokens than the keep_last_n_words parameter, it returns the input unchanged.

When the history_memory input is split into different paragraphs (delimited by newlines), it removes the paragraphs from the beginning of the list until keep_last_n_words is less than the total number of tokens left. Afterwards, the function joins the newline characters between the paragraphs and returns them.

Generally, this function is used in natural language processing tasks to trim down the length of dialogue history by considering only the last keep_last_n_words words. It can improve the model’s ability to focus on the most relevant information and improve its performance.

def get_new_image_name(org_img_name, func_name="update"):
    head_tail = os.path.split(org_img_name)
    head = head_tail[0]
    tail = head_tail[1]
    name_split = tail.split('.')[0].split('_')
    this_new_uuid = str(uuid.uuid4())[:4]
    if len(name_split) == 1:
        most_org_file_name = name_split[0]
    else:
        assert len(name_split) == 4
        most_org_file_name = name_split[3]
    recent_prev_file_name = name_split[0]
    new_file_name = f'{this_new_uuid}_{func_name}_{recent_prev_file_name}_{most_org_file_name}.png'
    return os.path.join(head, new_file_name)

There are two parameters in this Python function, org_img_name and function_name: org_img_name and func_name.

Based on the original image filename and the name of the function that modified it, this function should generate a new, unique filename for a modified image. Keeping track of multiple versions of the same image may be necessary in various computer vision tasks, such as image processing or object detection.

In the first step of the function, the original image filename is split up into the head (directory) and tail (filename) segments.

As a result, it extracts the relevant parts from the tail component using underscores as delimiters. To indicate the name of the function that modified the image, the func_name input is inserted into the new filename.

An original filename and the func_name input are combined to generate a new, unique filename using a random UUID (unique identifier). Upon completion of the function, the resultant filename is returned.

Based on information from the original filename and the name of the modifying function, this function generates new, unique filenames for modified images in computer vision tasks. Maintaining multiple versions of the same image can prevent important data from being overwritten.

class InstructPix2Pix:
    def __init__(self, device):
        print(f"Initializing InstructPix2Pix to {device}")
        self.device = device
        self.torch_dtype = torch.float16 if 'cuda' in device else torch.float32
        self.pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained("timbrooks/instruct-pix2pix",
                                                                           safety_checker=None,
                                                                           torch_dtype=self.torch_dtype).to(device)
        self.pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(self.pipe.scheduler.config)

InstructPix2Pix is a Python class with an __init__ method and three attributes: device, torch_dtype, and pipe.

Using this class is likely to enable computer vision tasks such as translating images between images using a pre-trained InstructPix2Pix model. Using InstructPix2Pix, you can colorize, superresolve, and transfer styles using a diffusion-based generative model.

It initializes the device attribute of the class with the value of the device input. If the device string contains the substring ‘cuda’, it sets torch_dtype to torch.float16, otherwise it sets torch_dtype to torch.float32. Depending on the hardware used, the torch_dtype attribute determines the precision of the model’s computations.

With the timbrooks/instruct-pix2pix model checkpoint, the pipe attribute is initialized as a pre-trained StableDiffusionInstructPix2PixPipeline object. Image-to-image translation is set up with the diffusion-based InstructPix2Pix model.

Using the from_config method with the config attribute of the scheduler object in the pipe, the pipe object’s scheduler attribute is also set to an EulerAncestralDiscreteScheduler object. As part of the image-to-image translation process, the EulerAncestralDiscreteScheduler is used to schedule the diffusion process.

This class provides a convenient interface for performing diffusion-based generative modeling on a specified device using a pre-trained InstructPix2Pix model.

@prompts(name="Instruct Image Using Text",
             description="useful when you want to the style of the image to be like the text. "
                         "like: make it look like a painting. or make it like a robot. "
                         "The input to this tool should be a comma separated string of two, "
                         "representing the image_path and the text. ")

As part of this method, an image path and a text input are comma-separated, and the pre-trained InstructPix2Pix model is used to create a new image following the style indicated by the text. As an example, stylizing an image or rendering an artistic image may be useful in computer vision.

def inference(self, inputs):
        """Change style of image."""
        print("===>Starting InstructPix2Pix Inference")
        image_path, text = inputs.split(",")[0], ','.join(inputs.split(',')[1:])
        original_image = Image.open(image_path)
        image = self.pipe(text, image=original_image, num_inference_steps=40, image_guidance_scale=1.2).images[0]
        updated_image_path = get_new_image_name(image_path, func_name="pix2pix")
        image.save(updated_image_path)
        print(f"\nProcessed InstructPix2Pix, Input Image: {image_path}, Instruct Text: {text}, "
              f"Output Image: {updated_image_path}")
        return updated_image_path

As part of the InstructPix2Pix class, this method is named inference. A single argument, inputs, represents the image path and a text instruction for changing the image’s style, separated by commas.

As soon as the InstructPix2Pix inference process starts, a message is printed. Using PIL.Image.open(), it loads the original image from the path and text instructions in the input string. The text instruction and the original image are passed to the pipe object of the StableDiffusionInstructPix2PixPipeline class to generate a new image. Inference steps are controlled by the num_inference_steps parameter, and guidance from the image is controlled by the image_guidance_scale parameter. The method get_new_image_name() saves the generated image with a new name and returns the path to the new image.

class Text2Image:
    def __init__(self, device):
        print(f"Initializing Text2Image to {device}")
        self.device = device
        self.torch_dtype = torch.float16 if 'cuda' in device else torch.float32
        self.pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5",
                                                            torch_dtype=self.torch_dtype)
        self.pipe.to(device)
        self.a_prompt = 'best quality, extremely detailed'
        self.n_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, ' \
                        'fewer digits, cropped, worst quality, low quality'

    @prompts(name="Generate Image From User Input Text",
             description="useful when you want to generate an image from a user input text and save it to a file. "
                         "like: generate an image of an object or something, or generate an image that includes some objects. "
                         "The input to this tool should be a string, representing the text used to generate image. ")
    def inference(self, text):
        image_filename = os.path.join('image', f"{str(uuid.uuid4())[:8]}.png")
        prompt = text + ', ' + self.a_prompt
        image = self.pipe(prompt, negative_prompt=self.n_prompt).images[0]
        image.save(image_filename)
        print(
            f"\nProcessed Text2Image, Input Text: {text}, Output Image: {image_filename}")
        return image_filename

Text2Image is a class defined in the given code. An image can be generated from text input using this class. Initialization of the class instance is achieved by loading and setting some default prompts in the __init__ method. By using a pre-trained model and the user’s input text, the inference method produces an image. By appending the default prompt a_prompt to the input text and concatenating it with the n_prompt, it passes this combined prompt to the pre-trained model. As an output, the name of the generated image is returned, along with the file name. Using the prompts decorator, this method is given a name and description, making it easier to understand.

class ImageCaptioning:
    def __init__(self, device):
        print(f"Initializing ImageCaptioning to {device}")
        self.device = device
        self.torch_dtype = torch.float16 if 'cuda' in device else torch.float32
        self.processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        self.model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-base", torch_dtype=self.torch_dtype).to(self.device)

    @prompts(name="Get Photo Description",
             description="useful when you want to know what is inside the photo. receives image_path as input. "
                         "The input to this tool should be a string, representing the image_path. ")
    def inference(self, image_path):
        inputs = self.processor(Image.open(image_path), return_tensors="pt").to(self.device, self.torch_dtype)
        out = self.model.generate(**inputs)
        captions = self.processor.decode(out[0], skip_special_tokens=True)
        print(f"\nProcessed ImageCaptioning, Input Image: {image_path}, Output Text: {captions}")
        return captions

Descriptions of images can be generated using the ImageCaptioning class. The __init__ method initializes the class with the specified device (CPU or GPU), loads the BlipForConditionalGeneration model from the Salesforce/blip-image-captioning-base model, and initializes a BlipProcessor. A path is taken as input, preprocessed using the BlipProcessor, and sent to the model for caption generation. It then returns a caption based on the model’s generated caption. In this method, a prompt describing its use and input is generated by the prompts decorator. A message is printed when the inference method is called indicating that it is processing the image, generating a caption from the processed image, and returning it.

class Image2Canny:
    def __init__(self, device):
        print("Initializing Image2Canny")
        self.low_threshold = 100
        self.high_threshold = 200

    @prompts(name="Edge Detection On Image",
             description="useful when you want to detect the edge of the image. "
                         "like: detect the edges of this image, or canny detection on image, "
                         "or perform edge detection on this image, or detect the canny image of this image. "
                         "The input to this tool should be a string, representing the image_path")
    def inference(self, inputs):
        image = Image.open(inputs)
        image = np.array(image)
        canny = cv2.Canny(image, self.low_threshold, self.high_threshold)
        canny = canny[:, :, None]
        canny = np.concatenate([canny, canny, canny], axis=2)
        canny = Image.fromarray(canny)
        updated_image_path = get_new_image_name(inputs, func_name="edge")
        canny.save(updated_image_path)
        print(f"\nProcessed Image2Canny, Input Image: {inputs}, Output Text: {updated_image_path}")
        return updated_image_path

A Canny edge detection algorithm is used to detect edges on an image using the Image2Canny class. By using the Canny algorithm, it detects edges in an image path and returns it as a new image.

As the Canny algorithm defaults to 100 and 200 values for low_threshold and high_threshold, respectively, these values are used in the constructor.

Inference methods are decorated with the @prompts decorator, which indicates that they are intended for generating a new image with edge detection.

A numpy array is created from the input image using PIL.Image.open. With the low_threshold and high_threshold values, the Canny edge detection algorithm is applied to the image. Using the get_new_image_name method, we convert the edges back to an image using PIL.Image.fromarray. As a final step, the new image path is printed and returned.

class CannyText2Image:
def __init__(self, device):
        print(f"Initializing CannyText2Image to {device}")
        self.torch_dtype = torch.float16 if 'cuda' in device else torch.float32
        self.controlnet = ControlNetModel.from_pretrained("fusing/stable-diffusion-v1-5-controlnet-canny",
                                                          torch_dtype=self.torch_dtype)
        self.pipe = StableDiffusionControlNetPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5", controlnet=self.controlnet, safety_checker=None,
            torch_dtype=self.torch_dtype)
        self.pipe.scheduler = UniPCMultistepScheduler.from_config(self.pipe.scheduler.config)
        self.pipe.to(device)
        self.seed = -1
        self.a_prompt = 'best quality, extremely detailed'
        self.n_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, ' \
                            'fewer digits, cropped, worst quality, low quality'

The CannyText2Image class initializes by calling this method. Taking a device parameter as input, it initializes class variables.

The torch_dtype variable is set to torch.float16 if the device contains the word “cuda” (meaning it’s a GPU), and torch.float32 otherwise.

An initial ControlNetModel is used from the “fusing/stable-diffusion-v1–5-controlnet-canny” model to initialize the controlnet variable.

Pipe variable is initialized using the pre-trained StableDiffusionControlNetPipeline parameterized with controlnet, safety_checker, and torch_dtype values from runwayml/stable-diffusion-v1–5 model.

config.pipe.scheduler sets the scheduler variable of pipe to a UniPCMultistepScheduler.

We then initialize the class variables seed, a_prompt, and n_prompt to their default values.

@prompts(name="Generate Image Condition On Canny Image",
             description="useful when you want to generate a new real image from both the user description and a canny image."
                         " like: generate a real image of a object or something from this canny image,"
                         " or generate a new real image of a object or something from this edge image. "
                         "The input to this tool should be a comma separated string of two, "
                         "representing the image_path and the user description. ")

A prompt for method inference is created with the @prompts decorator here. In the prompt, the method is described as “Generate Image Condition On Canny Image”. This function receives a string of two values separated by commas as input. The first value represents the path to a canny image, and the second value represents the user description. Based on the user description and the canny image, the method generates a new real image of an object or something.

def inference(self, inputs):
        image_path, instruct_text = inputs.split(",")[0], ','.join(inputs.split(',')[1:])
        image = Image.open(image_path)
        self.seed = random.randint(0, 65535)
        seed_everything(self.seed)
        prompt = f'{instruct_text}, {self.a_prompt}'
        image = self.pipe(prompt, image, num_inference_steps=20, eta=0.0, negative_prompt=self.n_prompt,
                          guidance_scale=9.0).images[0]
        updated_image_path = get_new_image_name(image_path, func_name="canny2image")
        image.save(updated_image_path)
        print(f"\nProcessed CannyText2Image, Input Canny: {image_path}, Input Text: {instruct_text}, "
              f"Output Text: {updated_image_path}")
        return updated_image_path

Using the CannyText2Image class, the inference method takes two comma-separated strings as inputs: the first input is the canny image path, and the second input is the description of the image.

The canny image is opened using Image.open and stored in the variable image inside the inference method. By appending a comma to self.a_prompt and the user description, a prompt is generated that generates high-quality images. As well as the image object and other parameters such as num_inference_steps, eta, and negative_prompt, this prompt is passed to the pipe object (a StableDiffusionControlNetPipeline object).

On the basis of the canny image and the user description, the pipe object generates a new real image. With the func_name argument set to “canny2image”, the generated image is then saved to a new file using the get_new_image_name function. By inferring the path of the saved image, the inference method returns the image’s location.

class Image2Line:
  def __init__(self, device):
          print("Initializing Image2Line")
          self.detector = MLSDdetector.from_pretrained('lllyasviel/ControlNet')

Image2Line appears to use the ControlNet repository’s pre-trained model MLSDdetector. Device argument specifies the device on which the computation will take place when the class is initialized.

As part of the __init__ method, the pre-trained MLSDdetector model is loaded from the ControlNet repository using the from_pretrained method. Assigning the model to the class’s detector attribute follows.

Using a pre-trained model, this class detects lines in an image.

@prompts(name="Line Detection On Image",
             description="useful when you want to detect the straight line of the image. "
                         "like: detect the straight lines of this image, or straight line detection on image, "
                         "or perform straight line detection on this image, or detect the straight line image of this image. "
                         "The input to this tool should be a string, representing the image_path")

In the Image2Line class, this prompt decorator is used to define the input format for the inference method. In addition to naming and describing the prompt, it also specifies what kind of input string should be passed to the method. It expects a single string argument that represents the path to the image file that will be used for line detection.

def inference(self, inputs):
        image = Image.open(inputs)
        mlsd = self.detector(image)
        updated_image_path = get_new_image_name(inputs, func_name="line-of")
        mlsd.save(updated_image_path)
        print(f"\nProcessed Image2Line, Input Image: {inputs}, Output Line: {updated_image_path}")
        return updated_image_path

The Image2Line class implements this inference method. One input is required, and it is the path to an image file.

In the first step, the image file is opened using PIL.Image.open() and the result is stored in the image variable.

A detector model is then used to detect straight lines in the input image, which was previously initialized in the class constructor. A call to the MLSDdetector class’ __call__() method is used with the image variable as input, and the MLSD variable is used to store the output.

In the updated_image_path variable, a new file name is generated using the get_new_image_name() function with the input string and suffix “_line-of”.

PIL.Image returns the path to the updated image after saving the mlsd image to the updated_image_path file through its save() method. Along with the input image path and the output line path, the function also prints a message indicating the Image2Line inference was processed.

class LineText2Image:
    def __init__(self, device):
        print(f"Initializing LineText2Image to {device}")
        self.torch_dtype = torch.float16 if 'cuda' in device else torch.float32
        self.controlnet = ControlNetModel.from_pretrained("fusing/stable-diffusion-v1-5-controlnet-mlsd",
                                                          torch_dtype=self.torch_dtype)
        self.pipe = StableDiffusionControlNetPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5", controlnet=self.controlnet, safety_checker=None,
            torch_dtype=self.torch_dtype
        )
        self.pipe.scheduler = UniPCMultistepScheduler.from_config(self.pipe.scheduler.config)
        self.pipe.to(device)
        self.seed = -1
        self.a_prompt = 'best quality, extremely detailed'
        self.n_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, ' \
                            'fewer digits, cropped, worst quality, low quality'

    @prompts(name="Generate Image Condition On Line Image",
             description="useful when you want to generate a new real image from both the user description "
                         "and a straight line image. "
                         "like: generate a real image of a object or something from this straight line image, "
                         "or generate a new real image of a object or something from this straight lines. "
                         "The input to this tool should be a comma separated string of two, "
                         "representing the image_path and the user description. ")
    def inference(self, inputs):
        image_path, instruct_text = inputs.split(",")[0], ','.join(inputs.split(',')[1:])
        image = Image.open(image_path)
        self.seed = random.randint(0, 65535)
        seed_everything(self.seed)
        prompt = f'{instruct_text}, {self.a_prompt}'
        image = self.pipe(prompt, image, num_inference_steps=20, eta=0.0, negative_prompt=self.n_prompt,
                          guidance_scale=9.0).images[0]
        updated_image_path = get_new_image_name(image_path, func_name="line2image")
        image.save(updated_image_path)
        print(f"\nProcessed LineText2Image, Input Line: {image_path}, Input Text: {instruct_text}, "
              f"Output Text: {updated_image_path}")
        return updated_image_path

Initialization of the class occurs first in the constructor, which prints a message confirming it was successfully initialized. If the given device is CUDA-enabled, it sets torch_dtype to torch.float16; otherwise, it sets torch_dtype to torch.float32.

From the “fusing/stable-diffusion-v1–5-controlnet-mlsd” model checkpoint, it loads the pre-trained ControlNetModel for line detection using the from_pretrained method. Input images are processed using this model to detect straight lines.

A ControlNetModel instance is passed to the StableDiffusionControlNetPipeline as the controlnet argument from the pretrained “runwayml/stable-diffusion-v1–5” model checkpoint. By combining a user description and a line image, this pipeline model creates a new image.

The pipeline scheduler is configured and set to UniPCMultistepScheduler from the pipeline model.

A random seed is then set for reproducibility by setting seed to -1.

Positive prompts are represented by a_prompt, while negative prompts are represented by n_prompt.

A decorator function with @prompts defines the inference method. Using this method, you need to provide two inputs — the image path and the user description — separated by commas. Firstly, it uses the Image module from the PIL library to open the image.

Pytorch_lightning uses seed_everything to set the random seed for reproducibility.

For the StableDiffusionControlNetPipeline model, the user description and positive prompt are combined with the input image, as well as some other parameters, to generate the output image. A path to the generated image is returned when the method uses the get_new_image_name function to save the output image.

An image is generated using the user description and line image given to the LineText2Image class during execution using pre-trained models.

Power of ChatGPT for Enhanced Video and Image Interpretation

Make chatGPT understand from pictures and videos, and also allow chatgpt to enhance and make changes to photos and video.

Imports

Visual ChatGPT

This post is for paid subscribers