https://developers.google.com/open-source/gsoc/resources/downloads/GSoC-logo-horizontal.svg https://www.python.org/static/img/python-logo@2x.png https://python-gsoc.org/logos/fury_logo.png

Google Summer of Code Final Work Product#

Abstract#

The goal of this project was to implement a Large Language Model (LLM) chatbot that understands the FURY repository. The purpose of the project is to reduce the barrier of entry to scientific visualization. Retrieval Augmented Generation (RAG) was used to get the necessary context for every user query. Multiple variations were explored, including Fine-Tuning models, mixing Fine-Tuning and RAG and RAG alone. Multiple chunking strategies were also explored for data collection and storage. The models are served to the user through a Discord Bot and a GitHub App. All the API endpoints are hosted using HuggingFace Spaces. Pinecone was used as the database for storing embeddings. Benchmarking, data collection, and testing were done on another repository.

Proposed Objectives#

The objectives of the GSoC project could be broadly classified as:

  • Figuring out hosting.

    We were constrained by the need to minimize hosting costs. We managed to complete the whole project with 100% free hosting. Work here included:

  • Choosing the technologies to use.

    Work here included:

  • Work on the backend architecture.

    Backend architecture was heavily influenced by HuggingFace and its limitations. Work here included:

    • Choosing the API architecture.

    • Integrating different models.

    • Improving concurrent requests support.

    • Improving the UX of the endpoints.

  • Work on improving model accuracy.

    This was a recurring work and kept happening on most weeks. It included:

    • Model Benchmarking

    • Data Collection

    • Experiments on Retrieval Augmented Generation.

    • Experiments on Fine-Tuning.

    • Experiments on Chunking.

    • Experiments on Retrieval quantity.

  • Discord Bot integration.

    The work included:

    • Building the Discord Bot.

    • Improving the UX of the bot.

    • Improving the performance of the bot.

  • GitHub App integration.

    The work included:

    • Building the GitHub App integration.

    • Improving the UX of the integration.

Objectives Completed#

  • Figuring out hosting.

    As mentioned, we had a constraint on the cost. We explored different options for free hosting. This took us to explore interesting directions like Google Colab and Kaggle Notebooks. In the end, HuggingFace was decided to be the best place. Everything is containerized and currently hosted on HuggingFace.

    This also meant that all the upcoming design/architectural choices would have to be based on HuggingFace. This will cause some challenges on the Discord bot hosting but overall HuggingFace was a solid choice.

    A very detailed blog on hosting is available here.

    The plan is to move all the HuggingFace repositories from my account to FURY’s account. But here, I’ll link to all my repositories which are currently active as I’m writing this report.

    • Embeddings Endpoint

      This endpoint converts natural language to embeddings. The model is loaded using HuggingFace SentenceTransformer.

    • Ollama Endpoint

      This endpoint could be used to communicate with the Ollama models. The perk of using this is it is more convenient and generally faster. A separate repository was required because a single free HuggingFace Space cannot allocate more than 16 GB RAM and 2vCPUs. Token generation speed will be hit if it’s not a separate repository.

    • Database Endpoint

      This endpoint was used to get the K-Nearest (or Approximate) embeddings based on cosine similarity. The parameter K could be passed to adjust it. We used Pinecone as the database.

    • FURY Discord Bot

      The repository for the Discord bot. It was required to use threading here which is a side-effect of HuggingFace Spaces. HuggingFace server only activates once there is an active live endpoint. Discord did not need an endpoint, but we had to make one to get the server activated. The Discord bot ran on a separate thread while a server ran on the main thread.

    • FURY external cloud endpoints

      This repository orchestrated external APIs from 3rd party providers like Groq and Gemini. We made it a separate repo to abstract the logic and simplify calling different endpoints as required. You can hot-swap multiple LLM models by changing the REST API parameters.

    • GitHub App

      Repository for the GitHub application. Receives webhooks from GitHub and acts upon them using GraphQL queries.

    • FURY Engine

      This is the main endpoint both Discord and GitHub frontend applications hit. It orchestrates all the other endpoints. The architecture of how it works is detailed later below.

    • FURY Data Parsing/Benchmarking/Testing Repo (GitHub)

      This is a GitHub repository and contains all the parsing, benchmarking and testing scripts.

  • Choosing the technologies to use

    Choosing the technology depended largely on HuggingFace hardware support. We experimented with inferencing LlamaCPP directly, inferencing Ollama, tested different quantizations and so on. Phi-3-mini-4k-instruct was chosen initially as the LLM. We rolled with it using Ollama for a few weeks. But as luck has it, I ended up discovering Groq is a cloud provider that provides free LLM endpoints. We used Groq from then on, and later also integrated Gemini since they also have a free tier.

    You can hot-swap between a local model, a Groq model, a Gemini normal model or a Gemini Fine-Tuned model as you wish using the FURY Engine endpoint. it’ll all integrate cleanly with the Pinecone database outputs and give a standard API response.

  • Work on the backend architecture

    This is the present backend architecture.

    Present backend architecture

    You’re only hitting the FURY Engine endpoint, the remaining are all abstracted away. You can tell the engine you need to use Gemini and it’ll do that for you. This is also expandable, if you have a new provider, you can add a new endpoint and connect it to FURY Engine.

    The data to the REST endpoint will look like this

    {
    "query": "Render a cube in fury",
    "llm": "llama3-70b-8192",
    "knn": "3",
    "stream": false
    }
    

    Every output response will look like this

    {
    "response": "Yes, this is how it would be done python import fury....",
    "references": "1, 2, 3"
    }
    

    So if you do

    curl -X POST https://robinroy03-fury-engine.hf.space/api/groq/generate -H “Content-Type: application/json” -d ‘{“query”: “How do I create a sphere in FURY?”, “llm”: “llama3-70b-8192”, “knn”: “3”, “stream”: false}’

    You’ll get a response from llama3-70b-8192 using Groq. If you do https://robinroy03-fury-engine.hf.space/api/google/generate you can call any Google Gemini models like gemini-1.5-pro or gemini-1.5-flash. Same for Ollama.

    A detailed blog on architecture is available here.

  • Work on improving model accuracy

    The initial version had major issues of hallucination and was unable to retrieve relevant context. We fix them by collecting more data, improving RAG, setting up a benchmark and so on.

    The Initial version used a naive parser to parse code, later my mentors told me to use an AST parser. I chunked the entire repo using this and it performed relatively better. For model benchmarking, we had 2 tests, one QnA testing and one code testing. If the code compiles, the model gets one point.

    All the benchmarking, data parsing, and database upsertion scripts are here.

    We used an image model called moondream2 to validate the output generated by the model. Since FURY is a graphics library, we need to judge the image to see whether it is correct or not.

    Fine-tuning was done on Google AI Studio. We Fine-Tuned using question/answer pairs from Discord and GitHub discussions. We later tried mixing RAG + Fine-Tuning. A detailed blog on Fine-Tuning is available here.

    A detailed blog on benchmarking is available here.

    A detailed blog on chunking is available here.

  • Discord Bot integration

    This included building the Discord bot and connecting it with the backend API. As mentioned above, threading was used to get the bot running on the server. But this won’t affect any other part of the bot and it’ll work as usual.

    This is what the discord integration looks like:

    Present Discord Bot UI.

    The code runs! This is the output of the code:

    Output of the code.

    Work was also done on improving the UX of the bot. There are 👍 and 👎 options available for the user to rate the answer. We’ll use those signals to improve the bot further. There are reference links at the bottom that lead to the exact places where the answers are sourced from. You can technically also use the Discord bot as a search engine if you want to.

    Initially, the bot had a sync over async problem. It was later fixed. Now multiple people can converse with the bot simultaneously.

  • GitHub App integration

    This included building the GitHub app and figuring out how to setup the UX for it. GitHub used GraphQL, but we didn’t use a separate GraphQL library for this. We used a custom setup to query GraphQL endpoints. For us who only work with 1 or 2 commands, it works well. The code is here.

    GitHub App UI looks like this:

    Present GitHub App UI.

    It is similar to Discord because the results come from the same backend. Refer to the backend architecture above for reference.

Other Objectives#

  • Improving the LLM output (ongoing)

    This will continue till I’m satisfied. It’s a never ending journey :) Much of this GSoC was setting up things and getting it all to work as one piece. There are tons of new ideas coming up every day to increase LLM accuracy. I’ll explore them and try interesting ones.

  • Tests for all endpoints (ongoing)

    It’s important to have tests for all endpoints. Testing includes the following:

    • Check the endpoints with valid data to see the response. Validate the JSON format.

    • Check the endpoints with incorrect schema and record the response.

    • Test by adjusting parameters like KNN.

  • X Bot (Optional Goal, deferred for now)

    I had a talk about this with my mentors. This can be done by plugging the LLM backend into an X bot frontend, but they suggested spending my time improving model accuracy rather than simply adding another frontend for the LLM application.

Other Open Source tasks#

GSoC isn’t all about what I do with my project. It exists along with the 3 other cool projects my peers - Wachiou, Iñigo and Kaustav did. I learnt a lot through them reviewing my PRs and me reviewing their PRs. I attended all the weekly meetings of Wachiou to learn about his progress and to learn new stuff. He attended all my meetings too, which was awesome :)

Contributions to FURY apart from the ones directly part of GSoC:
Contributions to other repositories during this time, due to GSoC work:

Acknowledgement#

I am very thankful to my mentors Serge Koudoro and Mohamed Abouagour. They were awesome and provided me with a comfortable environment to work in. Also got to thank Beleswar Prasad Padhi who gave me a very good introduction to opensource. The good thing about open source is I can still work on this (and other FURY projects) till I’m satisfied. I’m excited to continue contributing to the open source community.

Timeline#

GSoC 2024 Weekly Reports#

Week

Description

Blog Post Link

Week 0

Community Bonding!

Blog 0

Week 1

It officially begins…

Blog 1

Week 2

The first iteration!

Blog 2

Week 3

Data Data Data!

Blog 3

Week 4

Pipeline Improvements and Taking The Bot Public!

Blog 4

Week 5

LLM Benchmarking & Architecture Modifications

Blog 5

Week 6

UI Improvements and RAG performance evaluation

Blog 6

Week 7

Surviving final examinations

Blog 7

Week 8

Gemini Finetuning

Blog 8

Week 9

Hosting FineTuned Models

Blog 9

Week 10

Learning GraphQL

Blog 10

Week 11

Getting the App Live

Blog 11

Week 12

Wrapping things up

Blog 12