Week 3: Data Data Data!
=======================

.. post:: June 16 2024
   :author: Robin Roy
   :tags: google
   :category: gsoc

Hi, I'm `Robin <https://github.com/robinroy03>`_ and this is my blog about week 3.

My goal for week 3 was to collect data more efficiently and improve the citations. I also had my mid-terms during this week so I had to get things done fast.

Things I did in week 3
----------------------

1) **A better data parsing technique**

My initial approach was naive, it was just regex and some common filtrations. Later, my mentors told me to use the `inspect` module. I studied that module and realized that I needed to parse data using an AST. I didn't use the `inspect` module to do the parsing, since I only had to get the function/class signature and docstrings. So instead I used the ``ast`` module from python stdlib. My mentors gave me the general direction to go through - which was using ASTs to parse data effectively.

So now we have a script which you run like `python extractor.py fury` and it'll generate the appropriate JSON files.

`{"path": "../..", "function/class name": "name", "docstring": "..", "class_methods": ["method1", "..."]}`

I also changed the upserting chunk format. Earlier it was just strings, now it is JSON (same thing above). I do not have a scientific reason for this, but empirically it looks like it helped. Benchmarking is something I'm planning to do next week.

Metadata format:

`metadata: {"path": "../..", "function/class name": "name", "docstring": "..", "methods": [(method1, docstring), (method2, docstring), ...]}`

2) **Links for citation**

Now the bot shows links for citations. Because of the new parsing, I was able to do that pretty efficiently.

.. image:: /_static/images/gsoc-robin-3-fury-discord-bot-references-url.jpg
    :alt: Link based references for the LLM output.


3) **Faster Inference**

So this is something about the Generative AI field. There are too many things happening you might miss some stuff. `Groq` is a company providing free APIs for the llama and other opensource models (free for now, at least). Its inference speed is also super high. So I decided to integrate that also into our infrastructure.
Since everything is a microservice in our architecture, it is easy to add new things.

Our architecture:
   .. raw:: html

      <img src="https://github.com/fury-gl/fury-communication-assets/blob/main/gsoc_2024/7-6-2024-demo-architecture-gsoc-robin-week2.png?raw=true">

So now, along with Ollama, we have Groq inference also. I aim to make a `router` so that we can swap different providers as required. I'm also very interested in integrating Google Gemini 1.5 Flash and other models. Groq does not support fine-tuning, but Flash supports it and is `free of cost <https://developers.googleblog.com/en/gemini-15-pro-and-15-flash-now-available/#:~:text=To%20support%20that%2C%20we%20will%20also%20be%20rolling%20out%20tuning%20support%20for%20Gemini%201.5%20Flash%20on%20June%2017th.%20Tuning%20will%20be%20supported%20in%20both%20Google%20AI%20Studio%20and%20the%20Gemini%20API%20directly.%20Currently%2C%20tuning%20jobs%20are%20free%20of%20charge%2C%20and%20using%20a%20tuned%20model%20does%20not%20incur%20any%20additional%20per%2Dtoken%20costs.>`_ (for now). Our architecture is platform agnostic, so we can try out different things without being locked into any of them. We will also fine-tune our phi3 model since we have the data with us.

    .. raw:: html

        <iframe src="https://github.com/robinroy03/fury-discord-bot/assets/115863770/234fee85-9eb4-4fd5-a334-9e6d11e552a3" width="640" height="390" frameborder="0" allowfullscreen></iframe>

4) **Dockerizing Discord Bot**

I earlier used the huggingface implementation (copied their implementation demo). It was bad. My mentors suggested to dockerize the bot so I did that.


What is coming up next week?
----------------------------

- Benchmarking. Now we have the data, but we need to properly benchmark to see whether the modifications I make every day are making the bot dumber or smarter.
- Study different techniques to improve model answer accuracy such as `HyDE <https://arxiv.org/abs/2212.10496>`_.
- Study how to go forward with fine-tuning.
- Improved references.
- Collect more data.


Did you get stuck anywhere?
---------------------------

No, everything went well this week. Exam preparation was a pain though😢.

LINKS:

- `Gemini Blog <https://developers.googleblog.com/en/gemini-15-pro-and-15-flash-now-available>`_

- `HyDE <https://arxiv.org/abs/2212.10496>`_

- `Robin :) <https://github.com/robinroy03>`_

Thank you for reading!