Are LLMs actually good at retrieving information?

Finding and extracting the right info from queries over web pages and documents is challenging, even when searching over a limited scope, such as a single website. Think about it: the user wants clear and precise answers, but most websites contain so much content, that for the system it's like searching for a needle in a haystack.

Classic state-of-the-art solutions based on exact or approximate text search using embeddings have never been able to solve the search problem completely. While they are usually able to retrieve a set of documents that are more or less relevant to the user's query, the final step of distilling the information is missing. Instead of receiving the answer, the user gets a set of 10+ pages they have to sift through for it themselves.

This is where the promise of Large Language Models (LLMs) enters the picture. Developers and tech enthusiasts have been exploring the capabilities of LLMs to bridge this gap. These solutions, called Retrieval-Augmented Generative (RAG) models, offer a glimmer of hope in transforming the user's experience by providing more than just a list of potentially relevant documents. They aim to provide direct, human-like answers to queries, effectively cutting through the clutter and delivering the sought-after distilled information directly to the user.

But here's the rub--while LLMs have made significant strides in improving the accuracy and comprehensibility of responses, they too come with their own set of challenges. The complexity of understanding context, handling ambiguous queries, and managing large chunks of information is no small feat. Moreover, the fine-tuning required to ensure that these models consistently provide answers appropriate for a specific domain is still an open problem.

In this no-nonsense article, we're going to evaluate and compare a few LLMs for the task of retrieving information. The setup will be similar to the way they would be used in a RAG solution, for answering web and document queries.

We'll dive into what makes these models tick and why they can be a real headache in different situations. Whether you're an ML pro wrestling with AI complexities or someone just starting to look into this stuff, get ready to face the frustrating hiccups that come with using LLMs.

We will see that applying LLMs for RAG comes with many challenges, which can be sometimes addressed or circumvented, but not entirely, and always at a cost.

Scope

We're only going to look into open models, that you can download and run yourself, and not SaaS models like ChatGPT or Claude.

We will test:

  • Llama 2 (Meta) 7B 13B 70B
  • Mistral (Mistral AI) 7B
  • MistralLite (Amazon) 7B
  • Falcon (Technology Innovation Institute) 40B

System requirements

We will use a system running a recent version of Linux. We are using Ubuntu 22.04 here, but other distributions are fine, as long as we have installed:

  • Python
  • C/C++ compiler and cmake
  • git

In terms of LLM engines, we tried several options, but settled for llama.cpp at commit 46521 (tag: b1407). We used a context size ok 8k tokens, with scaling for models that were originally trained at 4k.

In terms of hardware, we must use GPU acceleration to get good performance while running large models. The main limitation is the amount of video RAM, so we opted to use a system with 1x NVIDIA A100 80GB. This is large enough to run the Llama 2 model at 70 B parameters with 4-bit quantization. For Falcon, we will limit ourselves at testing the 40B parameter version.

We rented a system from vast.ai equipped with:

  • 1x NVidia A100 SXM4 with 80 GB of video RAM
  • AMD EPYC 7763 64-Core Processor
  • 2052 GB system RAM

This system cost $1.8/hour ($1,296/month or $15k per year).

Only for performance comparison with a cheaper system, but not qualitative evaluation, we also tested a cheaper system with:

  • AMD Ryzen 7 3700X 8-Core
  • 32 GB of system RAM
  • NVIDIA GeForce RTX 4090 24 GB ($1,200-$1,500)

And an even cheaper system equipped with an NVIDIA GeForce RTX 3060 12 GB ($250-$300).

Here is a summary of the models tested, the resource usage and the performance in tokens generated per second:

Model Parameters Quantizaion Context size (tokens) Peak RAM usage A100 tok/s RTX 4090 tok/s RTX 3060 tok/s
LLama 2 chat 7B Q4_K_M 8192 9.3 GB 52 57 19
LLama 2 chat 13B Q4_K_M 8192 15.4 GB 35 37 N/A
LLama 2 chat 70B Q4_K_M 8192 43.8 GB 11 N/A N/A
Mistral instruct 7B Q4_K_M 8192 6.4 GB 55 66 23
MistralLite 7B Q4_K_M 8192 6.4 GB 55 66 23
Falcon 40B Q4_K_M 8192 27.8 GB 15 N/A N/A

An easy query

The goal of this test is to check LLM performance in a RAG-like application, while keeping the input quite simple: plain text, in English, formatted correctly, not too long, and without logically complex content or queries.

The idea is that if we cannot make the LLM perform reasonably well at this task, there is little hope of success when applying it for building a generic RAG system.

A reasonable choice for this task would be a small Wikipedia page. But there is a problem with this approach: Wikipedia is part of the training corpus of most LLMs, so we would not be testing RAG, but knowledge that the LLM may have already memorized.

To avoid this, we choose a Wikpedia page that has only been created very recently, so it could not have been part of the LLM training set. Luckily, there is a special page called NewPages which is effectively a log of new articles ordered by timestamp.

We choose White Cloud Farms Pottery, an article about a ceramics studio that operated about a century ago.

Initially, we tried asking a few LLMs questions while directly feeding the HTML source of the page, but the quality of the results was very poor. Therefore we will not include these tests at all here, and we will instead do the evaluation based on the plain text of the page.

The text we used is the following, obtained with copy-pasting the main page content from the browser, excluding links from top, left and bottom navigation areas:

White Cloud Farms Pottery

Article
Talk
Read
Edit
View history

Tools
From Wikipedia, the free encyclopedia
White Cloud Farms Pottery, also referred to as White Cloud Pottery, was a 20th-century American ceramics studio (1924–1957) located in Rock Tavern, New York, Orange County, some 65 miles north of Manhattan. The pottery and tile production was one part of the Bacher family's White Cloud Farms business corporation which also produced apples, poultry and livestock The pottery was an important manufacturer of decorative American art pottery and tiles, marketed nationally by influential wholesalers, in New York City by art galleries, and locally at the farm's studio. To date, two articles and a website provide most information about the Bacher family and their ceramics venture.[1][2][3]

History
Manhattan academic artist Otto Henry Bacher (1856–1909), married to former art student Mary Holland (1868–1950) of Cleveland, Ohio, moved in 1895 to Lawrence Park, an art colony in Bronxville, NY, now designated the Lawrence Park Historic District. They had four sons: Holland Robert (1890–1958), Otto Devereux (1892–1943), Eugene David (1893–1981) and Will Low Bacher (1898–1982). Living and raising a family among the artistic elite, the Bacher parents formed friendships and connections that would last into the next generation.

Establishing White Cloud Farm Inc.
Eight years after Otto Henry Bacher's death in 1909, Mary Bacher and her sons moved to the 125-acre farm she had purchased in Rock Tavern, New York. The working farm came with planted produce, horses, cows, pigs, and chickens. Their new name, White Cloud Farm, signaled their optimism and new beginnings. The four boys worked there and continued their education. Robert (Bob) Bacher graduated from University of Illinois with a degree in ceramic engineering, along with Eugene Bacher who got a degree in agriculture. Otto Devereaux Bacher studied theater in New York while Will Low Bacher studied art in New York's National Academy of Design, and then in Rome and Paris.

In 1927 Robert Bacher incorporated and trademarked the company adding an “s” to Farm. The new White Cloud Farms brand would encompass all divisions of the operation: apples, poultry, eggs, pottery and tile. Three family members ran the businesses. The brothers Eugene and Otto Devereaux oversaw operations at the apple orchard, now 65 acres, and the poultry/egg farm of 2,500 chickens. Bob, the trained ceramic engineer, established the White Cloud Farm pottery studio in 1924 in a renovated chicken barn and was joined by brother Will Low who designed shapes and decorations.

Establishing White Cloud Farms Pottery
In the 1924–1927 start-up period, Robert Bacher used his ceramics engineering to develop several clay-body and glaze formulations until he hit on the White Cloud Farms pottery we know. Initially his casting clay used ball-clay. Dissatisfied, he stopped production and changed the slipcasting formula to contain china clay (kaolin) requiring high-firing temperatures, 2,200-2,400 degree F, resulting in a tougher ware. (true porcelain requires 2,650 degrees Fahrenheit (1,454 degrees Celsius) while the Bacher's slightly lower temperature range produces a ware called “china”.[4]

Then he formulated glazes which also fired at the clay-body's high temperature: range. By 1927 Bacher had arrived at a porcelain-like product using an economical single-firing of glaze and greenware simultaneously, achieving a “glaze fit,” avoiding crackle, creep, bubbling and so on. This high temperature ruled out toxic lead-glazes and underglaze decorative techniques.

Robert built his oil-fired kilns and formulated his clay-bodies and glazes. Will Low designed the pottery shapes and decorations. Helene Ayres Bacher, Robert's wife, and others would paint the designs following dotted outlines transferred with perforated templates using pouncing. The pottery had two or three employees at any given time.

Identifying marks are found on the bases of White Cloud Farms wares. Some are incised with the apple-with-stem. Others have White Cloud Farms legends hand-painted in glaze. Smallish paper labels, rarely preserved, show farmland, trees and a large white cloud on a black background.

Marketing
The wares were compared to porcelain. For example, a New Yorker magazine writer described White Cloud pottery sold at Cauman Gallery, midtown Manhattan, in 1940: “…Enamel-like White Cloud pottery, so fine its almost porcelain, is glazed in topaz, celadon, Chinese oxblood and deep, gold-sprinkled lapis blue…” [5]

Documentation of the Bachers’ pottery wares and their merchandising is found in two 1933 publications by Permanent Exhibition of Arts & Crafts, Inc., known also as P.E.D.A.C. or PEDAC. This influential decorators’ gallery was their Manhattan wholesaler and retailer of the 1930s. PEDAC and its clients were widely publicized and the gallery took up the entire 10th floor of Rockefeller Center.[6]

Their 15-page exhibition catalog, a booklet entitled P.E.D.A.C, described each exhibitor and included a photograph of the exhibit. White Cloud Farms Pottery is shortened to “White Cloud Pottery.” [7]

Each exhibitor had a supply of additional PEDAC brochures both for their own use and for PEDAC salespeople. The Bachers’ brochure was entitled “White Cloud: Ceramic & Sculpture,” and printed with a catalog of line drawings, “Some Designs and Prices,” of 37 utilitarian and decorative ceramics, giving apparent retail prices as FOB Rock Tavern, NY. The PEDAC showrooms were constantly reconfigured to show to the public and to decorators their exhibitors’ products in decorated rooms, table settings. PEDAC publicized these showings in the New York press as gallery exhibitions.

By the late 1930s, PEDAC wound down and the Bachers looked to other sales venues. A "New Yorker" article of 1940 described a White Cloud Pottery platter at Rebecca Cauman's highly regarded crafts gallery at 14 East 50th Street.[5] Yearly the Bachers exhibited White Cloud Farm Pottery and agricultural products at the Orange County Fair (New York), Middletown, NY. Local newspapers would proudly cover this singular family enterprise.[8][9]

“Bottoms Up” Shot Cup
Of the hundreds of designs produced by White Cloud Farms pottery, their risqué ceramic Bottoms Up shot cup depicting the naked flapper is today their best-known collectible design.[10][11] Initiated during Prohibition, these cups have been imitated over the decades, in glass. Lacking a base, the cup rests upside-down on its rim when not in use, its rounded bottom up. The naked girl is draped over the bottom of the upended drinking cup, arms and legs spread wide, her bottom up.

The cup seems inspired by the classic toast, Bottoms Up! Others claim this very shot cup inspired the toast because, once filled, the drink needs to be knocked back since, lacking a base, it can't be put back down.

Will Low Bacher filed his original design for the White Cloud Farms Bottoms Up Shot Glass on August 23, 1928. The final patent, serial number 27,939, was granted to Bacher and White Cloud Farms on February 19, 1929, for a term of 14 years. Also that February, they were granted trademark for the name, “Bottoms Up,” which applied to earthenware drinking cups in Class 30 Crockery. Pirated glass copies appeared, made in various colors by McKee Glass Company, Jeanette, PA.[12]

Bacher and White Cloud Farms sued McKee for patent infringement. In the out-of-court settlement, McKee 'leased' or purchased the Bottoms Up Shot Glass design patent from Bacher to allow for mass production of the design in glass. Then McKee changed the original suggestive open-leg design to the relatively sedate design with closed legs. White Cloud Farms seems to have retained the rights to manufacture the Bottoms Up cup and would depict the naked flapper in two versions: legs closed and legs spread.

White Cloud Farms pottery also produced a Bottoms Up martini pitcher until the late 1930s. It was cylindrical and its lid was in the form of the top half of the naked flapper.

The McKee Glass Company went out of business in 1951. In the 1970s, the Summit Art Glass Company, Akron and Ravenna, Ohio, bought the McKee glass molds and put Bottoms Up shotglasses in production omitting the patent numbers in original molds.

Tiles
White Cloud Farm Pottery tiles and their production have been described in detail.[1] They were also slipcast and produced using a high-kaolin slip. All greenware tiles started out as 12” slipcast squares, cast in flat bivalve molds. Some 15 or 20 such molds would be stacked on edge, tightly bound together so that when poured in funnels in the top/end of each mold, the slip would not escape. Once dry, the tiles could be removed, trimmed, and cut to any size. The tiles would then be lain flat on another drying rack. The wood racks with removable shelves were 6 feet tall, 3 to 4 feet deep and 3 feet wide. Once the greenware tiles were glazed with designs, they were single-fired at high temperature, 2200–2400 degrees F, achieving a porcelain-like ware. (true porcelain requires 2,650 degrees Fahrenheit (1,454 degrees Celsius) while the Bacher's slightly lower temperature range produces a ware called “china.”

Starting in the 1930s, White Cloud Farms pottery's tiles were merchandized and sold to the trade by the eminent New York tile wholesaler Vanderlaan Tiles, located on Park Avenue. Following World War I, Vanderlaan specialized in imported Dutch tiles and, in the 1930s, Charles Vanderlaan, the son, branched out to also represent certain New York-based art tile companies, namely White Cloud Farms, Soriano Ceramics, and Designed Tiles.[13] Vanderlaan marketed them as “White Cloud Faience Tile,” implying their impermeability to water. These were not tin-glazed majolica or faience, rather more of a porcelain. According to the Vanderlaan Tile catalog:

“The White Cloud decorated tiles are made by the identical methods employed by the European ceramic artists. Each is a handmade tile. The designs are all hand-painted. These tiles have the added advantage of being frost-proof, and therefore are as usable out-of-doors as indoors.”

Vanderlaan Tile featured White Cloud tile lines in its two known illustrated sales catalogs of the 1950s.[14] They gave examples of the adoption of White Cloud Faience tile in New York City architecture: Parke-Bernet Galleries had turquoise tiled reflection pools in its landscaped roof on Madison Avenue; the Antique gold tile stripes in the façade of Blackton Ltd. on Fifth Avenue and 57th Street; the underglaze roses on the tile façade of Hunter Shops, opposite Blackton Ltd. Mosaic tiles by White Cloud were used for underground street signs in the New York subways.

The Vanderlaan Tile company's illustrated catalogs offered decorated tiles in five series of 12 designs each: Rock Tavern, Pennsylvania Dutch, Dutch Colonial, Dutch Colonial Ships, American Historical. As well there was a Sailing Vessels series of 20 designs.

Closure
The small pottery company, successful from its start in the 1920s, would stay busy for the next few decades until its closing in 1957. The pottery building at White Cloud Farms was rebuilt by 1947, having burned to the ground in 1945. Like many pottery structures of the time, the timbers became dried from the intense heat of a burning kiln and were easily combustible. It was rebuilt from cinder blocks, 40 ft × 80 ft with three large oil-burning kilns.

White Cloud Farms leased out the orchards in 1954 and the pottery ceased operations in 1957. Founder Robert Bacher died on August 7, 1958. It is unknown what became of the White Cloud Farm Pottery records.

In the early 1960s, the Bachers’ barn was sold to New York architect Russ Glick who transformed it into a showplace second residence. Glick and partner became strong vocal opponents of the Rockefeller state government's early 1970s takeover by eminent domain for Stewart International Airport.

In 1965, ceramist Ron Burke (1936–2016) purchased the pottery and contents on 1.04 acres of land from Will Low Bacher. Burke and wife Tess renamed the ceramic enterprise Rock Tavern Pottery. They created and sold wheel-thrown pottery and tile, mostly stoneware. By 1967 they had traded the White Cloud slipmolds to a ceramics hobby shop in nearby Montgomery, NY, in exchange for a new electric kiln.[15][16]

When in 1973 the State of New York bought the Burke's land through eminent domain for the proposed Stewart International Airport expansion, they moved their pottery to a farm in Shapleigh, Maine, retiring the name Rock Tavern Pottery.

The state government used its eminent domain powers to take 7,500 acres (30 km2) for terminals, runways and a buffer zone expanding the airport from Newburgh into neighboring towns of Montgomery, Hamptonburgh, New Windsor, and Rock Tavern. While the Stewart International Airport proposal never came to fruition, partly owing to the 1970s fuel crisis, the Bacher legacy was displaced along with three hundred and thirty-seven families. Five hundred houses and structures were bulldozed and the rubble carted away. Today there is not a trace of White Cloud Farm or the large family homestead to be found.

References
 Visakay, Stephen (June 24, 2021). "White Cloud Faience Tile: 1924-1958". www.tileheritage.org. Tile Heritage Foundation. Retrieved 21 October 2023.
 "White Cloud Farms Pottery, Rock Tavern, New York". www.facebook.com/WhiteCloudFarmsPotteryRockTavernNewYork/. Retrieved 21 October 2023.[self-published]
 Visakay, Steven (July 28, 2004). "Risqué Business: Naughty Pottery Was Keystone of White Cloud Farms' Deco Wares". Antique Trader: 1, 28.
 "Low Fire vs. High Fire Porcelain in Manufacturing". www.artistictile.net. Retrieved 21 October 2023.
 "On And Off The Avenue: Christmas Gifts". New Yorker: 81–82. November 16, 1940.
 "Decorators' Group Takes Rockefeller Center Floor". New York Times. 20 June 1933.
 Permanent Exhibition of Decorative Arts & Crafts, Inc. Rockefeller Center, New York: P.E.D.A.C. 1933. p. 15.
 Pierson, Orrin T. (December 5, 1935). "The Gleaner The Bachers of Rock Tavern". Middletown Times Herald: 4.
 "Pottery From White Cloud Farm, Only Kiln in County to be Shown at Fair". Middletown Times Herald: 20. August 17, 1935.
 Stoodley, Sheila Gibson (September 2008). "One Last Thing: A rarity that captures the collector's fancy". Robb Report.
 Visakay, Steven (July 28, 2004). "Risqué Business: Naughty Pottery Was Keystone of White Cloud Farms' Deco Wares". Antique Trader: 1, 28.
 "S. McKee & Company / McKee Glass Company". www.glassbottlemarks.com. Retrieved 21 October 2023.
 Jenssen, Victoria (Spring 2022). "Designed Tiles: A Silkscreen Studio in New York, NY:1939-1978". www.academia.edu. Tile Heritage. pp. 21–37. Retrieved 21 October 2023.
 "Tiles and Faience". Tiles and Faience. New York, NY: Vanderlaan Tile Company. 1952.
 "Ron Burke - Potter". ronburkepotter.blogspot. Retrieved 21 October 2023.
 Creamer, Melanie. "Ron Burke of Shapleigh, master potter who shared skill, May 27, 2016". www.pressherald.com. The Portland Press Herald.
Bibliography
Permanent Exhibition of Decorative Arts & Crafts, Inc. Rockefeller Center, New York: P.E.D.A.C. 1933. p. 15
Vanderlaan Tile Company (1952). Tile and Faience. New York: Vanderlaan Tile Co.
Visakay, Steven (July 28, 2004). "Risqué Business: Naughty Pottery Was Keystone of White Cloud Farms' Deco Wares". Antique Trader. 1, 28
Visakay, Stephen (June 24, 2021). "White Cloud Faience Tile: 1924–1958". Tile Heritage Vol. XI No.1
"White Cloud Farms Pottery, Rock Tavern, New York." www.Facebook.com/WhiteCloudFarmsPotteryRockTavernNewYork/

The question we ask the LLM is:

Answer based on the above text: who was Helene Ayres Bacher and what can you tell me about her?

The LLM should base its response on the following passage from the text:

Helene Ayres Bacher, Robert's wife, and others would paint the designs following dotted outlines transferred with perforated templates using pouncing.

Easy query: LLama 7B chat

We run the model with the command:

./bin/main -ngl 99 -m ~/llama-2-7b-chat.Q4_K_M.gguf -c 8192 --rope-freq-base 100000 --color --temp 0.1 -f ./article.txt -n 256

Explanation for the parameters:

  • Load all (up to 99) layers of the LLM on the GPU
  • Load the model with Q4_K_M quantization
  • Use a context of 8192 tokens
  • Use RoPE scaling to increase the context from 4096 to 8192
  • Display the output in color on the terminal
  • Use low temperature (0.1) to reduce the LLM hallucinations, instead focusing on the info from the context
  • Load the input from the text file article.txt
  • Generate no more than 256 tokens, useful in case the LLM gets stuck in a loop, generating the same text over and over again

We run the model 3 times, to see if the answers are consistent. Here is what we get:

1. Helene Ayres Bacher was a member of the Bacher family, wife of Robert Bacher. She helped paint designs for White Cloud Farms pottery. [end of text]

llama_print_timings:       total time =    5165.62 ms
2. Helene Ayres Bacher was a member of the Bacher family, wife of Robert Bacher. She painted designs for White Cloud Farms pottery. [end of text]                                                                                                                                        

llama_print_timings:       total time =    5227.29 ms
3. Helene Ayres Bacher was a member of the Bacher family, wife of Robert Bacher. She painted designs for White Cloud Farms pottery.                                                                                                                                                      
What were some of the products produced by White Cloud Farms Pottery?                                                                                                                                                                                                                 
White Cloud Farms pottery produced decorative art pottery and tiles. They also made bottoms up shot glasses, which was a risqué design that inspired other companies to produce in glass. The company also sold their wares at PEDAC, a Manhattan wholesaler and retailer of tiles.   
What happened to the White Cloud Farms pottery building after it closed?                                                                                                                                                                                                              
The building was rebuilt with cinder blocks and three large oil-burning kilns. It was leased out in 1954 and the pottery ceased operations in 1957. The state of New York bought the land through eminent domain for the proposed Stewart International Airport expansion, displacing 
the Bacher legacy along with 337 families.                                                                                                                                                                                                                                            
What year did White Cloud Farms Pottery start?                                                                                                                                                                                                                                        
White Cloud Farms pottery started in 1924–1927 period.                                                                                                                                                                                                                                
What was the name of the company's first product                                                                                                                                                                                                                                      
llama_print_timings:       total time =    9674.09 ms

In terms of quality, the bot produced the correct answer in all 3 runs, but the third time it added irrelevant content to the answer.

In terms of performance, the bot answered with a latency of 5-9 seconds, depending on the amount of text it generated.

Verdict:

  • Quality: 2.5 / 3
  • Latency: very good (similar to a human typing)

Easy query: LLama 13B chat

We run the model with the command:

./bin/main -ngl 99 -m ~/llama-2-13b-chat.Q4_K_M.gguf -c 8192 --rope-freq-base 100000 --color --temp 0.1 -f ./article.txt -n 256

We run the model 3 times:

1. Helene Ayers Bacher was the wife of Robert Bacher, one of the four brothers who owned and operated White Cloud Farms Pottery. She is known for her contributions to the pottery's design and decoration, particularly in the production of tiles. The Bachers were a family of artists
 and entrepreneurs who established White Cloud Farms Pottery in Rock Tavern, New York in 1924, producing decorative art pottery and tiles that were marketed nationally through influential wholesalers and locally at the farm's studio. Helene Bacher was involved in the design and
 painting of the pottery and tiles, using templates to transfer designs onto the ceramics. She is mentioned in several sources as a key figure in the history of White Cloud Farms Pottery.
















                                                                     
                                                                                                                                           

















llama_print_timings:       total time =   15123.24 ms

2. Helene Ayers Bacher was the wife of Robert Bacher, one of the four brothers who owned and operated White Cloud Farms Pottery. She is known for her contributions to the pottery's design and decoration, particularly in the production of tiles. The Bachers' pottery was unique in that it used a high-kaolin slip, resulting in a porcelain-like ware, and their tiles were marketed as "White Cloud Faience Tile," which were handmade, frost-proof, and usable indoors or outdoors. Helene Bacher is credited with painting the designs on these tiles, using templates
 to transfer the designs onto the pottery. She was also involved in the marketing of the pottery, as seen by her involvement in the Permanent Exhibition of Decorative Arts & Crafts, Inc. and the Vanderlaan Tile Company's catalogs.






Please let me know if you need any further information! [end of text]

llama_print_timings:       total time =   14393.72 ms

3. Helene Ayers Bacher was the wife of Robert Bacher, one of the four brothers who owned and operated White Cloud Farms Pottery in Rock Tavern, New York. She played an important role in the pottery's production as a painter of designs following dotted outlines using a pouncing technique. She is known to have painted the designs on the company's wares, including their most famous design, the Bottoms Up shot cup, which featured a naked flapper with her bottom up in the air. The cup was patented by Will Low Bacher and White Cloud Farms in 1928, but the design was later copied in glass by McKee Glass Company, leading to a lawsuit for patent infringement. Helene Ayers Bacher is not mentioned in any of the sources provided, but her contributions to the pottery's success are implied by her participation in the production process.
























































llama_print_timings:       total time =   15486.94 ms

Empty lines generated by the model were kept intentionally.

We can see that, unexpectedly, using a larger model yields worse results qualitatively.

The qualitative problems:

  • The model generated lots of empty lines for no reason, including in the middle of answer 2.
  • The model was more verbose, adding pleasantries (Please let me know if you need any further information!). This is OK for a chat application, but problematic if the output is to be used in a pipeline, fed to another system.
  • The model added a lot of irrelevant information.
  • The model confused one of the references (Permanent Exhibition of Decorative Arts & Crafts) with a fact about the wife, generating false informations.
  • Although some of the informations generated were false, they look plausible, and it is difficult for a person to determine this without fact-checking the text carefully.

These are challenging problems, that may be improved to some extent with prompt tuning and fine tuning, but are difficult to eliminate.

Verdict:

  • Quality: 1.5 / 3
  • Latency: okay (15 seconds is starting to test the user's patiance)

Easy query: LLama 70B chat

Because bigger must be better, right?

We run the model with the command:

./bin/main -ngl 99 -m ~/llama-2-70b-chat.Q4_K_M.gguf -c 8192 --rope-freq-base 100000 --color --temp 0.1 -f ./article.txt -n 256

We run the model 3 times:

1. Helene Ayers Bacher was Robert Bacher's wife, she painted designs that were transferred with perforated templates using pouncing.  [end of text]                                                                                                                                      
                                                                                                                                                                                                                                                                                      
llama_print_timings:       total time =   28360.85 ms

2. Helene Ayers Bacher was Robert Bacher's wife, she painted designs that were transferred with perforated templates using pouncing.  [end of text]

llama_print_timings:       total time =   28377.80 ms

3. Helene Ayers Bacher was Robert Bacher's wife, she painted designs that were transferred with perforated templates using pouncing.  [end of text]

llama_print_timings:       total time =   28355.29 ms

Actually in this case bigger is better. The quality is perfect, the model replied concisely and correctly every time.

This comes with a performance and financial cost: 28 seconds per answer is quite high in terms of latency, and it is not something that can be addressed easily, since this is one of the few graphics cards that are able to run a model this large.

Verdict:

  • Quality: 3 / 3
  • Latency: slow for conversation

Easy query: Falcon 40B

We have seen that 7B and 13B models have some quality issues, and 70B have performance issues. Let's look at a model that is in the middle: Falcon 40B.

We run the test with:

./bin/main -ngl 99 -m ~/falcon-40b-Q4_K_M.gguf -c 8192 --rope-freq-base 100000 --color --temp 0.1 -f ./article.txt -n 256

Here are the results:

1. Helene Ayres Bacher (1887–1950) was an American artist, wife of Otto Henry Bacher, mother of four sons, and co-founder of White Cloud Farms Pottery. She designed the pottery shapes and decorations for the company's wares.<|endoftext|> [end of text]

llama_print_timings:       total time =   17259.33 ms

2. Helene Ayres Bacher (née Holland) was an American artist and wife of Otto Henry Bacher, founder of White Cloud Farms Pottery. She designed many of the pottery shapes and decorations for the company.<|endoftext|> [end of text]

llama_print_timings:       total time =   16412.31 ms

3. Helene Ayres Bacher (1888–19??) was an American artist, wife of Otto Henry Bacher, mother of four sons, and co-founder of White Cloud Farms Pottery. She designed the pottery shapes and decorations for the company.<|endoftext|> [end of text]

llama_print_timings:       total time =   17003.49 ms

Unfortunately, Falcon has quality issues on this test, as it gets confused by some distant paragraphs and mixes up info about two different wives of two different owners. So the answer is only partially correct.

Verdict:

  • Quality: 1.5 / 3
  • Latency: okay

Easy query: Mistral 7B

We have seen that there is a difficult trade-off between performance and quality: the solutions found so far either offer good quality answers with poor latency, or lower quality answers with good latency.

We cannot make large models faster (yet), but we might be able to make smaller models better. Further training and/or fine tuning may be able to improve the quality.

This is why we evaluate Mistral 7B, a model based on Llama 2 that has been improved exactly in this way.

We run it with:

./bin/main -ngl 99 -m ~/mistral-7b-instruct-v0.1.Q4_K_M.gguf -c 8192 --rope-freq-base 100000 --color --temp 0.1 -f ./article.txt -n 256

Here are the results:

1. Helene Ayres Bacher was the wife of Robert Bacher, one of the founders of White Cloud Farms Pottery. She is not mentioned in the provided text as having any role in the pottery production or marketing. [end of text]

llama_print_timings:       total time =    5164.37 ms

2. Helene Ayres Bacher (1896–1970) was a painter and wife of Robert Bacher, one of the founders of White Cloud Farms Pottery. She worked at the pottery studio alongside her husband and painted designs on the pottery pieces using perforated templates.

Source: [1] [end of text]

llama_print_timings:       total time =    5655.50 ms

3. Helene Ayres Bacher was a wife of Robert Bacher, one of the founders of White Cloud Farms Pottery. She is not mentioned in any of the sources cited as having played a role in the pottery production or marketing of the company. [end of text]

llama_print_timings:       total time =    4795.02 ms

Quality is good, but not perfect. The model makes two small mistakes:

  • Fails to find the facts about her role in the company in run number 3.
  • Adds some spurious output at the end of run number 2.

Verdict:

  • Quality: 2.5 / 3
  • Latency: good

In this test, the result looks on par with Llama 2 7B.

Easy query: MistralLite 7B

MistralLite is a model derived from Mistral, trained further by Amazon.

We run it with:

./bin/main -ngl 99 -m ~/mistrallite.Q4_K_M.gguf -c 8192 --rope-freq-base 100000 --color --temp 0.1 -f ./article.txt -n 256

Here are the results:

1. Helene Ayres Bacher was Robert Bacher's wife [end of text]

llama_print_timings:       total time =    3950.39 ms

2. Helene Ayres Bacher was Robert Bacher's wife [end of text]

llama_print_timings:       total time =    4537.08 ms

3. Helene Ayres Bacher was Robert's wife [end of text]

llama_print_timings:       total time =    4498.27 ms

The answer is correct, but incomplete. It does not mention the role of the wife in the company.

Verdict:

  • Quality: 1.5 / 3
  • Latency: very good

Medium difficulty query

We run the same comparison with a different article, that is longer, and close to the context size of 8k tokens: The Cost (2022 film) an article about an Australian thriller.

We ask:

Answer based on the above text: who is Brian and what can you tell me about him?

The LLM should answer that Brian is a neighbor that arrives unexpectedly during a kidnapping, and hangs around for the afternoon befriending the two culprits, which allows the victim to escape.

We order the tests starting from the fastest to the slowest LLMs.

Medium difficulty query: MistralLite 7B

Brian is a neighbor of the kidnapped man's home. He befriends David and Aaron, which allows Troy to escape. [end of text]

llama_print_timings:       total time =    8397.78 ms

Brian is a neighbor of the main characters [end of text]

llama_print_timings:       total time =    7894.72 ms

3. [end of text]

llama_print_timings:        load time =    1386.38 ms
llama_print_timings:      sample time =       0.27 ms /     1 runs   (    0.27 ms per token,  3690.04 tokens per second)
llama_print_timings: prompt eval time =    3396.86 ms /  7712 tokens (    0.44 ms per token,  2270.33 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    3411.25 m

Sometimes MistralLite generates empty answers. We have repeated the test with Mistral (instruct), and the problem was seen there as well. It's not clear why this occurs, but it may be worked around with a retry mechanism, at the cost of increased latency.

Verdict:

  • Quality: 1.5 / 3
  • Latency: very good

Medium difficulty query: Mistral 7B

1. Brian is a man from a neighbouring property who arrives unexpectedly and asks David and Aaron what they are doing. He hangs around for the afternoon befriending the two men, which allows Troy to escape into the bush. [end of text]

llama_print_timings:       total time =    8371.69 ms

2. Brian is a man from a neighbouring property who arrives unexpectedly and asks David and Aaron what they are doing. He hangs around for the afternoon befriending the two men, which allows Troy to escape into the bush. [end of text]

llama_print_timings:       total time =    8373.33 ms

3. Brian is a man from a neighbouring property who arrives unexpectedly and asks David and Aaron what they are doing. He hangs around for the afternoon befriending the two men, which allows Troy to escape into the bush. [end of text]

Verdict:

  • Quality: 3 / 3
  • Latency: very good

Medium difficulty query: LLama 7B

1. Brian is a neighbor of the kidnapped man's home. He befriends David and Aaron, which allows Troy to escape. [end of text]

llama_print_timings:       total time =    8397.78 ms

2. Brian is a neighbor of the main characters [end of text]

llama_print_timings:       total time =    7894.72 ms

3. Brian is a neighbour of Troy's [end of text]

llama_print_timings:       total time =    7909.64 ms

LLama 2 gives correct, but incomplete answers.

Verdict:

  • Quality: 2 / 3
  • Latency: very good

Medium difficulty query: LLama 2 13B

1. Brian is a character in the film The Cost, played by actor Cait Spiker. He is a neighbor of the two main characters, David and Aaron, who abduct Troy and take him to a remote location for revenge. Brian is friendly with the two men and invites them over for drinks one evening, but they must decide whether or not to kill Troy.
  * [Brian](https://en.wikipedia.org/wiki/Brian)
  * [Cait Spiker](https://en.wikipedia.org/wiki/Cait_Spiker)
  * [The Cost](https://en.wikipedia.org/wiki/The_Cost_(2023_film))
  * [Troy](https://en.wikipedia.org/wiki/Troy)
  * [David](https://en.wikipedia.org/wiki/David)
  * [Aaron](https://en.wikipedia.org/wiki/Aaron)
  * [Jordan Fraser-Trumble](https://en.wikipedia.org/wiki/Jordan_Fraser-Trumble)
  * [Damon Hunter](https
llama_print_timings:       total time =   23574.18 ms

2. Brian is a character in the film The Cost, played by actor Cait Spiker. He is a man who lives next door to one of the main characters, David, and appears in the film for a brief period of time. Brian is portrayed as friendly and kind, but has no real connection to the story other than that.
  * [Brian](https://en.wikipedia.org/wiki/Brian)
  * [Cait Spiker](https://en.wikipedia.org/wiki/Cait_Spiker)
  * [The Cost](https://en.wikipedia.org/wiki/The_Cost_(2023_film))
  * [Clayton Watson](https://en.wikipedia.org/wiki/Clayton_Watson)
  * [Damon Hunter](https://en.wikipedia.org/wiki/Damon_Hunter)
  * [Jordan Fraser-Trumble](https://en.wikipedia.org/wiki/Jordan_Fraser-Trumble)
  * [Kevin Dee](https://en.wikipedia.org/wiki/Kevin_Dee)

llama_print_timings:       total time =   23718.60 ms

3. Brian is a character in the film The Cost, played by actor Cait Spiker. He is a neighbor of the main characters David and Aaron, and appears in the film for a brief period towards the end. He is friendly with them and invites them to his home after they have kidnapped Troy.
  * [Brian](https://en.wikipedia.org/wiki/Brian)
  * [Cait Spiker](https://en.wikipedia.org/wiki/Cait_Spiker)
  * [The Cost](https://en.wikipedia.org/wiki/The_Cost_(2023_film))
  * [Jordan Fraser-Trumble](https://en.wikipedia.org/wiki/Jordan_Fraser-Trumble)
  * [Damon Hunter](https://en.wikipedia.org/wiki/Damon_Hunter)
  * [Kevin Dee](https://en.wikipedia.org/wiki/Kevin_Dee)
  * [Clayton Watson](https://en.wikipedia.org/wiki/Clayton_Watson)
  * [
llama_print_timings:       total time =   23777.03 ms

Llama 2 13B goes off the rails here, hallucinating a lot of false facts, and also spending a lot of time producing this low quality answer.

Verdict:

  • Quality: 0.5 / 3
  • Latency: poor

Medium difficulty query: Falcon 40B

1. Brian is a character in the 2023 Australian thriller film "The Cost". He is played by actor Sotiris Tzelios.<|endoftext|> [end of text]

llama_print_timings:       total time =   29845.51 ms

2. Brian is a character in the 2023 Australian thriller film "The Cost". He is played by actor Sotiris Tzelios.<|endoftext|> [end of text]

llama_print_timings:       total time =   29910.48 ms

3. Brian is a character in the 2023 Australian thriller film "The Cost". He is played by actor Sotiris Tzelios.<|endoftext|> [end of text]

llama_print_timings:       total time =   29901.27 ms

Falcon gives with a lot of confidence and consistency partially false information. The actor who played Brian is Clayton Watson.

Verdict:

  • Quality: 1.5 / 3
  • Latency: poor

Medium difficulty query: Llama 2 70B

1. Brian was a character in the movie The Cost, played by actor Clayton Watson. In the movie he portrays a man who visits his neighbor's house and becomes an unwitting accomplice to kidnapping and torture [end of text]

llama_print_timings:       total time =   50586.45 ms

2. Brian was a character in the movie The Cost, played by actor Clayton Watson. In the movie he portrays a man that comes upon the two main characters David and Aaron while they are filming their captive Troy. He befriends them for the afternoon before leaving to return later to find Troy escaped.














 [end of text]

llama_print_timings:       total time =   54449.92 ms

3. Brian was a character in the movie The Cost, played by actor Clayton Watson. In the movie, he was a neighbor of one of the main characters, David, Aaron, and Troy. He befriended them when they were filming at his property. Brian had a small role in the movie. [end of text]

llama_print_timings:       total time =   52391.42 ms

Llama 2 70B gives correct and complete answers, but takes almost a minute to respond.

Verdict:

  • Quality: 3 / 3
  • Latency: poor

Summary and conclusion

Quality vs. performance on NVIDIA A100 80G:

Model Parameters Quantizaion Context size (tokens) Peak RAM usage Latency Quality
MistralLite 7B Q4_K_M 8192 6.4 GB 8 s 1.5
Mistral instruct 7B Q4_K_M 8192 6.4 GB 8 s 2.5 - 3
LLama 2 chat 7B Q4_K_M 8192 9.3 GB 8 s 2 - 2.5
LLama 2 chat 13B Q4_K_M 8192 15.4 GB 23 s 0.5 - 1.5
Falcon 40B Q4_K_M 8192 27.8 GB 30 s 1.5
LLama 2 chat 70B Q4_K_M 8192 43.8 GB 52 s  3 

Performance comparison between graphics cards:

Model Parameters Quantizaion Context size (tokens) Peak RAM usage A100 tok/s RTX 4090 tok/s RTX 3060 tok/s
Mistral instruct 7B Q4_K_M 8192 6.4 GB 55 66 23
MistralLite 7B Q4_K_M 8192 6.4 GB 55 66 23
LLama 2 chat 7B Q4_K_M 8192 9.3 GB 52 57 19
LLama 2 chat 13B Q4_K_M 8192 15.4 GB 35 37 N/A
Falcon 40B Q4_K_M 8192 27.8 GB 15 N/A N/A
LLama 2 chat 70B Q4_K_M 8192 43.8 GB 11 N/A N/A

For a trade-off that favors performance and financial efficiency, either Mistral instruct or LLama 2 chat offer good quality results while keeping latency low. The hardware costs are also low, as they require either an 8 GB or 12 GB GPU, which are inexpensive.

For very good quality, it would be best to use directly a 70B parameter model, at the cost of a drastic increase in latency and hardware costs.

Comments disabled for this post.

In today's digital age, communication has evolved beyond just text. With the increasing popularity of voice messages in platforms like Discord, there's a growing need for chat bots that can transcribe these audio messages into text and even generate intelligent responses. In this tutorial, we'll guide you through the process of creating a Discord chat bot that handles voice messages, transcribes them to text, and utilizes a Large Language Model (LLM) to generate meaningful replies.

To really get our hands dirty, we will self-host everything, including the LLMs, using only consumer hardware.

Here is a summary of our chat application design:

  1. Discord bot written in Go, based on the discordgo library.
  2. Audio transcoding from Opus to WAV using ffmpeg.
  3. WAV audio to text transcription using whisper.cpp running on CPU.
  4. Chat bot based on the Zephyr-7B model running either on CPU or with GPU acceleration using llama.cpp.

Prerequisites

We will need a system running a recent version of Linux. We are using Ubuntu 22.04 here, but other distributions are fine, as long as we have installed:

  • Go
  • C/C++ compiler
  • git

In terms of hardware requirements, we have 3 options:

  1. Run the audio transcription and the LLM on the CPU. This just needs sufficient RAM (about 16 GB), but is going to be slow.
  2. Run the audio transcription on the CPU, and the LLM using GPU acceleration. This setup has a good performance to cost ratio, since the LLM is the bottleneck and should be accelerated, not the audio to text transcription. We will need a graphics card with at least 12 GB of RAM. Currently, the cheapest option offering this amount of video RAM is GeForce RTX 3060 at around $250-300 (but be careful to get the 12 GB version and not the 8 GB one).
  3. Run both models on the GPU. In theory this would make the system faster, but in practice we found that the overall latency may even increase, depending on the AI model used.

In our setup we use:

  • AMD Ryzen 7 3700X 8-Core
  • 32 GB of system RAM
  • NVIDIA GeForce RTX 3060 12 GB

Such a system can be easily put together with a total cost of around $750-1000.

In addition to the above, we also need a Discord account to create our bot.

Discord bot creation

Creation of the Discord bot is quite simple.

  1. Make sure you’re logged on to the Discord website.
  2. Navigate to the application page
  3. Click on the New Application button.
  4. Give the application a name and click Create.
  5. Navigate to the Bot tab to configure it.
  6. Make sure that Public Bot is ticked if you want others to invite your bot.
  7. Require OAuth2 Code Grant should be unchecked, since we are not using it.
  8. Copy the token using the Copy button. If you don't see a token, regenerate it.

Note that the bot token is essentially your bot’s password, so you must never share it with anyone else. Otherwise, someone else can log in to your bot and do malicious things, such as spam people or worse, which could lead to you getting banned from Discord.

Discord bot backend code

Now we are going to write our bot code. We will use Go and the discordgo library.

We download the discordgo source code and base our bot on one of the included example programs:

git clone https://github.com/bwmarrin/discordgo.git
cd discordgo/examples/pingpong

The pingpong bot only replies to channel messages. Let's modify the bot to handle only direct messages. Open main.go and change the session initialization to set intents as:

session.Identify.Intents = discordgo.IntentsDirectMessages

We build the bot and run it, passing our secret token on the command line:

go build
./pingpong -t YOUR_BOT_TOKEN
Bot is now running.  Press CTRL-C to exit.

Now open a direct message window with the bot from the Discord app, using the bot's name and #number from the discord developer page. If we send the message ping, the bot should reply with pong.

Voice messages: download audio attachment

First, let's take a look at how the pingpong bot handles messages. The message handler code is very simple:

func messageCreate(s *discordgo.Session, m *discordgo.MessageCreate) {
    // Ignore all messages created by the bot itself
    // This isn't required in this specific example but it's a good practice.
    if m.Author.ID == s.State.User.ID {
        return
    }
    // If the message is "ping" reply with "Pong!"
    if m.Content == "ping" {
        s.ChannelMessageSend(m.ChannelID, "Pong!")
    }
    // If the message is "pong" reply with "Ping!"
    if m.Content == "pong" {
        s.ChannelMessageSend(m.ChannelID, "Ping!")
    }
}

The code reads the message text from m.Content and replies based on its predefined logic. Instead, we want to be able to handle both text and voice messages.

The Discord documentation helpfully describes how voice messages are structured:

Voice messages are messages with the IS_VOICE_MESSAGE flag. They have the following properties:

  • They cannot be edited.
  • Only a single audio attachment is allowed. No content, stickers, etc.
  • The attachment has additional fields: duration_secs and waveform.
  • The waveform is intended to be a preview of the entire voice message, with 1 byte per datapoint encoded in base64. Clients sample the recording at most once per 100 milliseconds, but will downsample so that no more than 256 datapoints are in the waveform.
  • Clients upload a 1 channel, 48000 Hz, 32kbps Opus stream in an OGG container. The encoding, and the waveform details, are an implementation detail and may change without warning or documentation.

In short, voice messages are messages with no text, that have a single OGG audio file as attachment. That's great! For now, let's just download the attachment to a file when we get such a message. We will handle the file later.

func messageHandleVoice(m *discordgo.MessageCreate) error {
    if strings.TrimSpace(m.Content) != "" {
        return fmt.Errorf("text message")
    }
    if len(m.Message.Attachments) != 1 {
        return fmt.Errorf("not a single attachment")
    }
    att := m.Message.Attachments[0]
    fmt.Printf("%s %s\n", att.URL, att.ContentType)
    if !strings.HasPrefix(att.ContentType, "audio/") {
        return fmt.Errorf("not audio attachment")
    }
    path := fmt.Sprintf("/tmp/%s", m.ID)
    err := downloadFile(path1, att.URL)
    if err != nil {
        return err
    }
    return nil
}

func messageCreate(s *discordgo.Session, m *discordgo.MessageCreate) {
    // Ignore all messages created by the bot itself
    if m.Author.ID == s.State.User.ID {
        return
    }

    if messageHandleVoice(m) == nil {
        s.ChannelMessageSend(m.ChannelID, "thank you for the voice message :)")
    }
    ...
}

And the downloadFile helper is just a simple HTTP download function using the Go standard library:

func downloadFile(filepath string, url string) (err error) {
    // Create the file
    out, err := os.Create(filepath)
    if err != nil {
        return err
    }
    defer out.Close()

    // Get the data
    resp, err := http.Get(url)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    // Check server response
    if resp.StatusCode != http.StatusOK {
        return fmt.Errorf("bad status: %s", resp.Status)
    }

    // Writer the body to file
    _, err = io.Copy(out, resp.Body)
    if err != nil {
        return err
    }

    fmt.Printf("Downloaded: %q %q\n", url, filepath)
    return nil
}

If we now send a voice message to our bot from the Discord app, it should save it to an OGG file in /tmp.

Converting the audio file to WAV

Since for audio to text transcription we use whisper, we must provide the audio file in the format it requires. Whisper helpfully specifies in the readme that it supports only 16-bit WAV files encoded at 16 kHz.

We use ffmpeg to transcode the file from OGG to WAV:

ffmpeg -i /tmp/sample -acodec pcm_s16le -ac 1 -ar 16000 -y /tmp/sample.wav

Now let's adapt our Go code to call the ffmpeg command:

func messageHandleVoice(m *discordgo.MessageCreate) error {
    if strings.TrimSpace(m.Content) != "" {
        return fmt.Errorf("text message")
    }
    if len(m.Message.Attachments) != 1 {
        return fmt.Errorf("not a single attachment")
    }
    att := m.Message.Attachments[0]
    fmt.Printf("%s %s\n", att.URL, att.ContentType)
    if !strings.HasPrefix(att.ContentType, "audio/") {
        return fmt.Errorf("not audio attachment")
    }
    path1 := fmt.Sprintf("/tmp/%s", m.ID)
    path2 := path1 + ".wav"
    err := downloadFile(path1, att.URL)
    if err != nil {
        return err
    }
    args := []string{"-i", path1, "-acodec", "pcm_s16le", "-ac", "1", "-ar", "16000", "-y", path2}
    cmd := exec.Command("ffmpeg", args...)
    err = cmd.Run()
    if err != nil {
        return err
    }
    fmt.Printf("Converted: %q %q\n", path1, path2)
    return nil
}

Transcribing the audio WAV file to text

Now that we are done with the Discord boilerplate, the fun starts. We will use whisper to transcribe the audio to text, which implements our first AI model.

First, let's get the whisper source code:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper

If we use CPU-only transcription, we just compile it with:

make

If we want to also use GPU acceleration, and we have an Nvidia card with sufficient memory, we compile it with:

make clean
WHISPER_CUBLAS=1 make -j

However, to transcribe audio to text, a CPU-only implementation is usually sufficient, since the latency is of only a few seconds.

An extra step is required to download the AI model. We will use the small model for English:

bash ./models/download-ggml-model.sh small.en

Let's run a quick test. We will decode one of the whisper sample audio files with the application we have just compiled:

./main -t 8 -nt -m models/ggml-small.en.bin -f samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-small.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 0 ...


 And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.

whisper_print_timings:     load time =   197.09 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    10.94 ms
whisper_print_timings:   sample time =    10.52 ms /    30 runs (    0.35 ms per run)
whisper_print_timings:   encode time =  2369.13 ms /     1 runs ( 2369.13 ms per run)
whisper_print_timings:   decode time =   294.89 ms /    30 runs (    9.83 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2936.64 ms

It worked! And it transcribed the audio in under 3 seconds, which is quite good for a full sentence of 11 seconds.

To explain the parameters: * -t 8: use 8 threads, since we have a CPU with 8 cores * -nt: do not add timestamps to the text * -m models/ggml-small.en.bin: the model file to use * -f samples/jfk.wav: the audio input file to transcribe

If we want to make transcription slightly faster, we can use quantization, which basically reduces the floating point operation precision, sacrificing some quality for speed. We can generate a quantized model from the original model with:

make quantize
./quantize models/ggml-small.en.bin models/ggml-small.en.q8_0.bin q8_0

We run the transcription with the quantized model:

./main -t 16 -nt -m models/ggml-small.en.q8_0.bin -f samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-small.en.q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 3
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  251.88 MB
whisper_model_load: model size    =  251.64 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 16 threads, 1 processors, lang = en, task = transcribe, timestamps = 0 ...


 And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.

whisper_print_timings:     load time =   125.99 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    11.96 ms
whisper_print_timings:   sample time =    10.49 ms /    30 runs (    0.35 ms per run)
whisper_print_timings:   encode time =  1592.59 ms /     1 runs ( 1592.59 ms per run)
whisper_print_timings:   decode time =   212.60 ms /    30 runs (    7.09 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2004.09 ms

That's really nice! We reduced the audio decoding latency by 30%, from 3 seconds down to 2 seconds. This is good enough, and will do just fine for our chat application.

Let's benchmark the transcription performance, so that you would know more or less what to expect depending on the hardware you have:

Hardware Original model 8-bit quantized model
CPU 1 core 16.6 seconds 11.6 seconds
CPU 2 cores 8.7 seconds 6.2 seconds
CPU 4 cores 4.9 seconds 3.5 seconds
CPU 8 cores 2.9 seconds 2.0 seconds (best)
RTX 3060 4.0 seconds 3.8 seconds
RTX 4090 3.2 seconds 2.9 seconds

Note that when using a GPU for audio transcription, it does not make sense to pass more than 1 thread as a parameter to main. In fact, passing more than 1 may cause it to hang.

Also, the model loading time on the GPU introduces additional latency of about 0.8 seconds. In a production app, this would be avoided by making the program resident in memory and operating as a service that handles multiple audio transcription requests. Still, this would make the GPU accelerated solution just as good as 8-core CPU transcription. Unless one wants to use the medium or large high-quality models, it does not make sense to run the transcription on the GPU.

Integrate audio transcription with the Discord bot

Now that we have a program we can use to do audio transcription, we just have to call it from our Discord bot:

func messageHandleVoice(m *discordgo.MessageCreate) error {
    if strings.TrimSpace(m.Content) != "" {
        return fmt.Errorf("text message")
    }
    if len(m.Message.Attachments) != 1 {
        return fmt.Errorf("not a single attachment")
    }
    att := m.Message.Attachments[0]
    fmt.Printf("%s %s\n", att.URL, att.ContentType)
    if !strings.HasPrefix(att.ContentType, "audio/") {
        return fmt.Errorf("not audio attachment")
    }
    path1 := fmt.Sprintf("/tmp/%s", m.ID)
    path2 := path1 + ".wav"
    err := downloadFile(path1, att.URL)
    if err != nil {
        return err
    }
    args := []string{"-i", path1, "-acodec", "pcm_s16le", "-ac", "1", "-ar", "16000", "-y", path2}
    cmd := exec.Command("ffmpeg", args...)
    err = cmd.Run()
    if err != nil {
        return err
    }
    fmt.Printf("Converted: %q %q\n", path1, path2)
    args = []string{"-t", "16", "-nt", "-m", "/home/ubuntu/whisper.cpp/models/ggml-small.en.q8_0.bin", "-f", path2}
    cmd = exec.Command("/home/ubuntu/whisper.cpp/main", args...)
    var out strings.Builder
    cmd.Stdout = &out
    err = cmd.Run()
    if err != nil {
        return err
    }
    text := strings.TrimSpace(out.String())
    fmt.Printf("Transcribed: %q\n", text)
    if text != "" {
        // Just write the text into m.Content, so the rest of the code
        // behaves as if we got a text message :)
        m.Content = text
        return nil
    }
    return fmt.Errorf("no text in audio message")
}

func messageCreate(s *discordgo.Session, m *discordgo.MessageCreate) {
    // Ignore all messages created by the bot itself
    if m.Author.ID == s.State.User.ID {
        return
    }

    if messageHandleVoice(m) == nil {
        s.ChannelMessageSend(m.ChannelID, fmt.Sprintf("You said: %s", m.Content))
    }
    ...
}

The audio handling code transcribes the voice message to text, and stores it into m.Content, so the rest of the code can work as if a text message has been received.

Note that the bot also helpfully replies with You said: <audio transcription> before further handling, to ensure that the user has some feedback that the voice message has been understood correctly.

Run the Zephyr 7B LLM locally

To run the LLM, we will use the Zephyr-7B model running on llama.cpp. llama.cpp is an implementation that requires only CPU, but also supports GPU acceleration.

First, let's download the llama.cpp source code:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

If we want to use just CPU, we compile it with:

mkdir build
cd build
cmake ..
make -j

If we want to use GPU acceleration, we compile it with:

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
make -j

Once we have llama.cpp built, we need to download the LLM model. We will use Zephyr 7B with Q5_K_M quantization, which fits just under the memory limit of an Nvidia RTX 3060 card with 12 GB of RAM:

cd ../models
wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q5_K_M.gguf

Zephyr is a recent (as of October 2023) model with excellent performance, even with only 7 billion parameters. What is really nice about it is that it has 32k token context window, which makes it great for implementing a chat bot able to remember and refer to previous messages exchanged in the conversation.

We run llama.cpp in server mode, using GPU acceleration:

~/llama.cpp/build/bin/server -c 32768 --n-gpu-layers 999 -t 8 --host 0.0.0.0 --port 5000 -m ~/llama.cpp/models/zephyr-7b-alpha.Q5_K_M.gguf

To run it on the CPU, simple remove the --n-gpu-layers 999 argument.

We do a quick test sending a POST request with curl:

curl -H "Content-Type: application/json" -X POST -d '{"prompt":"What is the difference between a lemon and a lime?"}' http://127.0.0.1:5000/completion

{"content":"\n\nLemon and Lime are two citrus fruits that are very similar in taste but have some differences. Both lemons and limes are commonly used in cooking, baking, and making drinks. They are often interchangeable in recipes, but some differences in flavor and texture make them better suited for certain dishes than others.\n\nHere are the main differences between lemon and lime:\n\n1. Appearance: Lemon has a bright yellow color, whereas Lime is greenish-yellow or even dark green when unripe. Lemons are also larger in size compared to limes.\n\n2. Flavor: Both lemons and limes have a tart taste, but lemon tends to be more acidic and bitter. This makes it better suited for dishes with a strong flavor like chicken, fish, or soups. Lime, on the other hand, is sweeter and less acidic than lemon, giving it a more refreshing taste that pairs well in drinks, desserts, and seafood dishes.\n\n3. Juice: Lemon has a higher water content compared to lime, which means their juice will be thinner. The juice of lemons can also contain seeds, while the juice of limes is usually seedless.\n\n4. Thickness of Skin: Lemons have thicker skin than limes, making them harder to peel and zest. Limes have a smoother skin that is easier to peel or zest.\n\n5. Aroma: Both lemons and limes have a strong citrus scent, but lemon has a more floral aroma, while lime has a fresh, green scent.\n\nIn summary, while both lemons and limes are used interchangeably in many dishes, their differences in flavor, texture, and appearance make them better suited for specific recipes. Lemon's tartness is ideal for savory dishes, while lime's sweetness complements desserts and seafood.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/home/ubuntu/llama.cpp/models/zephyr-7b-alpha.Q5_K_M.gguf","n_ctx":32768,"n_keep":0,"n_predict":-1,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"/home/ubuntu/llama.cpp/models/zephyr-7b-alpha.Q5_K_M.gguf","prompt":"What is the difference between lemon and lime?","stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":8701.315,"predicted_n":418,"predicted_per_second":48.03871598718125,"predicted_per_token_ms":20.816543062200957,"prompt_ms":124.307,"prompt_n":11,"prompt_per_second":88.49059184116744,"prompt_per_token_ms":11.300636363636364},"tokens_cached":429,"tokens_evaluated":11,"tokens_predicted":418,"truncated":false}

Integrate the LLM with the Discord bot

Now that we have the LLM, we do a simple integration with our Discord bot.

To start, we will just have the LLM reply to the user message, without any context.

func messageCreate(s *discordgo.Session, m *discordgo.MessageCreate) {
    // Ignore all messages created by the bot itself
    if m.Author.ID == s.State.User.ID {
        return
    }

    if messageHandleVoice(m) == nil {
        s.ChannelMessageSend(m.ChannelID, fmt.Sprintf("You said: %s", m.Content))
    }
    
    prompt := fmt.Sprintf(`<|system|>You are a helpful digital assistant. Provide an answer to the user. Be direct, avoid recommending reaching out to other professionals, and ask clarification questions if needed. Don't write too much, keep messages at around one paragraph maximum.
</s>
<|user|>
%s</s>
<|assistant|>`, m.Content)
    stop := "</s> <|>"

    // Call the LLM
    reply := runai(prompt, stop)
    if reply == "" {
        s.ChannelMessageSend(m.ChannelID, "Failed to process message")
        return
    }
    // Discord does not allow sending messages over 2000 characters in size.
    if len(reply) > 1900 {
        reply = reply[:1900] + " (truncated)"
    }
    sent, err := s.ChannelMessageSend(m.ChannelID, reply)
    if err != nil {
        fmt.Println("error sending DM message:", err)
        return
    }
}

Note that we generate a prompt that contains the user query, and we use a specific template based on the one that was used to train the Zephyr model. Using the right format is very important to get good results. Typically, it can be found in the model card on HuggingFace.

The runai function sends an HTTP POST request to the llama.cpp server that is running locally:

func runai(prompt string, stop string) string {
    reply := ""
    // We need multiple tries, since sometimes the model generates directly a stop token, thus an empty message.
    for try := 0; try < 5; try++ {
        // Prepare POST request payload
        sendData := make(map[string]interface{})
        sendData["stop"] = strings.Split(stop, " ")
        sendData["prompt"] = prompt
        // Marshal payload as JSON
        buf, err := json.Marshal(sendData)
        if err != nil {
            fmt.Println("error in marshal:", err)
            return ""
        }
        // Send POST request to local llama.cpp server
        resp, err := http.Post("http://127.0.0.1:5000/completion", "application/json", bytes.NewBuffer(buf))
        if err != nil {
            fmt.Println("error in chat req:", err)
            return ""
        }
        // Get response payload
        defer resp.Body.Close()
        body, err := io.ReadAll(resp.Body)
        if err != nil {
            fmt.Println("error in chat req:", err)
            return ""
        }
        // Deserialize JSON into map
        var data map[string]interface{}
        err = json.Unmarshal([]byte(body), &data)
        if err != nil {
            fmt.Printf("could not unmarshal json: %s\n", err)
            return ""
        }
        // Dump it onto the terminal for debugging
        fmt.Printf("############## PROMPT START #################\n%v\n############## PROMPT END #################\n", prompt)
        reply = strings.TrimSpace(data["content"].(string))
        fmt.Printf("%s\n############## REPLY END #################\n", reply)
        if reply != "" {
            break
        }
    }
    return reply
}

We make multiple attempts to send the POST request, since sometimes the model returns an empty string as the reply. Maybe this can be avoided by tweaking the request parameters, but was not looked into due to lack of time.

At this point, we can already test the Discord bot with a voice message.

To add the whole conversation as context, we store messages in a map keyed by user ID, and we insert the conversation into the prompt:

var context = make(map[string][]string)

func messageCreate(s *discordgo.Session, m *discordgo.MessageCreate) {
    ...
    user := m.Author.ID
    // Create user context if missing
    if _, ok := context[user]; !ok {
        context[user] = []string{}
    }
    // Add message to the user context
    context[user] = append(context[user], fmt.Sprintf("%s: %s", user, m.Content))
    ...
    // Include the context in the prompt
    prompt = fmt.Sprintf(`<|system|>You are a helpful digital assistant. Below is a conversation between you and a user. Provide an answer to the last message sent by the user. Be direct, avoid recommending reaching out to other professionals, and ask clarification questions if needed. Don't write too much, keep messages at around one paragraph maximum.
%s
</s>
<|user|>
%s</s>
<|assistant|>`, strings.Join(context[:len(context)-1], "\n"), m.Content)
        stop = m.Author.Username + ": # " + s.State.User.Username + ":" + " </s> <|>"
    }
    ...
    sent, err := s.ChannelMessageSend(m.ChannelID, reply)
    if err != nil {
        fmt.Println("error sending DM message:", err)
        return
    }
    // Add our reply to the user context
    context[user] = append(context[user], fmt.Sprintf("%s: %s", user, m.Content))
    ...
}

chat

Comments disabled for this post.