Selected Projects
Below is a non-exhaustive list of my non-research projects. You can also check out a complete list of my projects here.
Agentic-RAG Story Generation with Multimodal GenAI [Code]
- Designed an advanced story generation system that leverages agentic multimodal GenAI to generate engaging & meaningful stories from user-uploaded images. It seamlessly integrates retrieval-based reasoning with generative AI using Large Vision-Language Models and vector search to craft immersive narratives.
- The system supports multiple data modalities (image & text), RAG-based retrieval for coherence, agentic AI-driven decision-making, the InternVL2-40B model, and the audio narration (Text-to-Speech) capability for engaging & immersive story generation.
- Libraries/Framework: Streamlit, vLLM Kernels, ChromaDB, LangChain, Cloudinary API, pyttsx3 (Text-to-Speech), and LangGraph
Multi-Round VLM-powered Multimodal Conversational AI Navigation Bot [Code]
- An advanced AI chatbot that enables users to upload images and ask questions (primarily Navigation-oriented) via text or audio, receiving real-time responses in both formats.
- Key features include:
- image upload and analysis
- speech-to-text conversion using the Google SpeechRecognition API
- integration of visual, text, and audio data for comprehensive interactions
- maintenance of conversation context across multiple turns
- real-time responses powered by Vision-Language multimodal models
- Libraries/Framework: Streamlit, vLLM Kernels, Google SpeechRecognition API, RunPod, pyttsx3 (Text-to-Speech), and OpenAI API
Visual Contrastive Learning-based Few-shot Image Classification [Code]
- Defined a custom contrastive loss and trained a few-shot version of Siamese Networks to do n-way k-shot image classification by mapping the image similarity task into a fully-supervised classification learning task.
- Libraries/Framework: Numpy, Matplotlib, PyTorch, and TorchVision
Molecule Graph Generation [Code]
- Implemented Graph Convolutional Networks-based Variational Graph AutoEncoders to generate new molecular graphs that possess similar statistical distribution as that of the learned distribution of molecular graphs (used to train the model).
- Libraries/Framework used: PyTorch, PyTorch Geometric, Numpy, and NetworkX
Human Activity Recognition [Code]
- Developed a Human Activity Recognition system that utilizes a pre-trained 3D convolutional ResNet-34 model to identify activities in videos on a per-frame basis.
- Trained on the Kinetics dataset, which includes 400 human activity classes and approximately 300,000 videos.
- The framework can automatically classify video datasets, monitor compliance in food service environments, and oversee patron behavior in bars and restaurants.
- Libraries/Framework: Numpy and OpenCV
Text-to-Image Generation using GANs [Code]
- Implemented a Stage-wise StackGAN model capable of producing photo-realistic images conditioned on text descriptions. It is also able to contain necessary details and vivid object parts while generating high-quality images.
- Given the text description, the Stage-1 GAN forms the primitive shape and colors of the object. It puts less emphasis on the quality of the image being formed, thereby yielding a low-resolution image.
The Stage-2 GAN takes Stage-1 results and text descriptions as inputs and generates high-resolution images with photo-realistic details and thus can rectify defects in Stage-1 results and add compelling details with the refinement
process.- Libraries/Framework: Keras, Tensorflow, Numpy, Pandas, and Matplotlib
An Unsupervised Approach to Generate Sentence Embeddings [Code]
- Trained a simple contrastive learning-based framework to perform text similarity, where sentences with similar semantic features attain higher similarity scores.
- Used a pre-trained BERT model to generate two different, yet semantically similar representations for each input sentence with minimal variation.
To compute the degree of similarity between these latent representations, employed a cosine
similarity-based contrastive metric.- Libraries/Framework: Scikit-learn, Tensorflow, Numpy, Pandas, and Transformers
Zero-shot Question Answering with Large Language Models [Code]
- Implemented a zero-shot question-answering system that, for each question
q
with available answer optionsa
,b
, andc
, computes each option’s score as the negative log-likelihood under the language model conditioned on the question and then returns the option with the highest score as the most probable answer to the questionq
. - Libraries/Framework: Transformers, Numpy, and Tensorflow