Back in April, we ran a series of useful and/or somewhat goofy prompts through Google's (then-new) PaLM-powered Bard chatbot and OpenAI's (slightly older) ChatGPT-4 to see which AI chatbot reigned supreme. At the time, we gave the edge to ChatGPT on five of seven trials, while noting that "it's still early days in the generative AI business."
Now, the AI days are a bit less “early," and this week's launch of a new version of Bard powered by Google's new Gemini language model seemed like a good excuse to revisit that chatbot battle with the same set of carefully designed prompts. That's especially true since Google's promotional materials emphasize that Gemini Ultra beats GPT-4 in "30 of the 32 widely used academic benchmarks" (though the more limited “Gemini Pro" currently powering Bard fares significantly worse in those not-completely-foolproof benchmark tests).
This time around, we decided to compare the new Gemini-powered Bard to both ChatGPT-3.5—for an apples-to-apples comparison of both companies’ current “free" AI assistant products—and ChatGPT-4 Turbo—for a look at OpenAI’s current “top of the line" waitlisted paid subscription product (Google’s top-level “Gemini Ultra" model won’t be publicly available until next year). We also looked at the April results generated by the pre-Gemini Bard model to gauge how much progress Google’s efforts have made in recent months.
While these tests are far from comprehensive, we think they provide a good benchmark for judging how these AI assistants perform in the kind of tasks average users might engage in every day. At this point, they also show just how much progress text-based AI models have made in a relatively short time.
Dad jokes
Prompt: Write 5 original dad jokes
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
Once again, both tested LLMs struggle with the part of the prompt that asks for originality. Almost all of the dad jokes generated by this prompt could be found verbatim or with very minor rewordings through a quick Google search. Bard and ChatGPT-4 Turbo even included the same exact joke on their lists (about a book on anti-gravity), while ChatGPT-3.5 and ChatGPT-4 Turbo overlapped on two jokes (“scientists trusting atoms" and “scarecrows winning awards").
AdvertisementThen again, most dads don’t create their own dad jokes, either. Culling from a grand oral tradition of dad jokes is a tradition as old as dads themselves.
The most interesting result here came from ChatGPT-4 Turbo, which produced a joke about a child named Brian being named after Thomas Edison (get it?). Googling for that particular phrasing didn't turn up much, though it did return an almost-identical joke about Thomas Jefferson (also featuring a child named Brian). In that search, I also discovered the fun (?) fact that international soccer star Pelé was apparently actually named after Thomas Edison. Who knew?!
Winner: We'll call this one a draw since the jokes are almost identically unoriginal and pun-filled (though props to GPT for unintentionally leading me to the Pelé happenstance)
Argument dialog
Prompt: Write a 5-line debate between a fan of PowerPC processors and a fan of Intel processors, circa 2000.
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
The new Gemini-powered Bard definitely "improves" on the old Bard answer, at least in terms of throwing in a lot more jargon. The new answer includes casual mentions of AltiVec instructions, RISC vs. CISC designs, and MMX technology that would not have seemed out of place in many an Ars forum discussion from the era. And while the old Bard ends with an unnervingly polite "to each their own," the new Bard more realistically implies that the argument could continue forever after the five lines requested.
On the ChatGPT side, a rather long-winded GPT-3.5 answer gets pared down to a much more concise argument in GPT-4 Turbo. Both GPT responses tend to avoid jargon and quickly focus on a more generalized "power vs. compatibility" argument, which is probably more comprehensible for a wide audience (though less specific for a technical one).
Winner: ChatGPT manages to explain both sides of the debate well without relying on confusing jargon, so it gets the win here.
A mathematical word problem
Prompt: If Microsoft Windows 11 shipped on 3.5" floppy disks, how many floppy disks would it take?
- Kyle Orland / Ars Technica
-
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
The improvement from pre- to post-Gemini Bard here is striking. While the old Bard gave a nonsensical answer of "15.11" floppy disks, the new LLM correctly estimated Windows 11's install size (which runs 20 to 30 GB, depending on the source) and correctly divided that 20GB estimate into 14,223 1.44MB floppy disks. The Gemini system also noticeably ran a "double-check" based on Google Search, helping increase user confidence in the answer.
ChatGPT suffers a bit in comparison. In ChatGPT-3.5, the system’s circa-January-2022 "knowledge update" generalizes Windows 11's install size at "several gigabytes," which the system "hypothetically" rounds to a too-low 10GB. GPT-4 Turbo, on the other hand, uses its circa April 2023 knowledge to estimate a too-large 64GB install size for the operating system (seemingly drawing from Microsoft's stated minimum storage requirements rather than how much space the OS actually uses on a fresh install). GPT-3.5 also interprets 10GB as exactly ten billion bytes causing a discrepancy with the "1 GB = 1,024 MB" interpretation used by Google Bard (GPT-4 Turbo also assumes 1,024 MB per GB).
Winner: Google Bard wins easily on both knowledge and math skills here.
Summarization
Prompt: Summarize this in one paragraph: [first three paragraphs of text from this article]
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
The new Gemini-powered Bard earns about a million brownie points both for noticing that the text comes from an Ars Technica article and prominently linking to it with a card (featuring a disturbing image of Smith eating said spaghetti). But the new Bard's summary eliminates some key details that the old Bard included, such as the fact that the video is stitched together from ten two-second clips. While the rewrite does improve readability somewhat, it does so at the expense of completeness.
AdvertisementChatGPT's summaries lose some points for being less concise; from 156 words of original text, ChatGPT generated summaries from 99 words (GPT-4 Turbo) to 108 words (GPT-3.5), compared to 63 to 66 words for old and new Bard, respectively. But ChatGPT's length is in large part because it summarizes more important details—such as the reactions from the press and the name of the original poster and subreddit—that Google leaves on the floor.
Winner: As much as we love Bard's linking to our original report, we have to give the edge to ChatGPT's more complete (if less concise) summaries.
Factual retrieval
Prompt: Who invented video games?
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
Bard again shows significant improvement here with its Gemini update. While the old Bard focused exclusively on Ralph Baer's "Brown Box'' and Magnavox Odyssey work (with information seemingly culled directly from Wikipedia), the new Gemini-powered Bard accurately and concisely notes the contributions of William Higinbotham's earlier "Tennis for Two."
Bard then expands a bit from "invention" to figures that made "significant contributions to the early development of video games" such as Nolan Bushnell, Ted Dabney, and Al Alcorn, providing mostly accurate and relevant info about each. But then Bard goes into a bit of a non-sequitur about Steve Jobs and Steve Wozniak creating the Apple II rather than mentioning their work at early Atari.
GPT-3.5 suffers from the same Baer-focus as the old Bard. Though it mentions that "the industry has seen contributions from various individuals and companies over the years," it doesn’t bother to name any of those important figures. GPT-4 Turbo, on the other hand, notes up front that video games "cannot be attributed to a single individual" and expands its summary to include Higinbotham, Bushnell, and, crucially, Steve Russell's 1962 creation of Spacewar! on the PDP-1.
Winner: Among the "free" options today, Bard gives a much better answer than GPT-3.5. If you subscribe to GPT-4 Turbo, though, you get the best AI-generated answer in our sample.
Creative writing
Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
While the old Bard gets points for some evocative writing ("[Lincoln] smiled to himself as he remembered playing games like that when he was a boy"), it loses points for going way longer than the two paragraphs asked for in the prompt (and for a confusing setting shift from Illinois to the White House in the first paragraph). The new Gemini-powered Bard keeps the same spirit ("He envisioned a sport that would unite people...") with more concision and focus.
Interestingly, GPT-3.5 is the only model we tested to imagine Lincoln as a youth instead of a restless president sitting in the White House. But GPT-4 Turbo was the only model we tested to explicitly mention Lincoln's actual "experiences as a wrestler" rather than a more general athleticism.
We were also intrigued by GPT-4 Turbo's idea that Lincoln essentially stole the concept of shooting a ball through a hoop from "a group of children" in the White House gardens. We hope that the fictional "honest Abe" at least credited those children for "a legacy that would outlive his years."
Winner: While the pre-Gemini Bard's story has some distinct deficiencies, all the other models have their unique charms and evocative phrases. This one is a draw.
Coding
Prompt: Write a Python script that says "Hello World," then creates a random repeating string of characters endlessly.
- Kyle Orland / Ars Technica
- Benj Edwards / Ars Technica
- Benj Edwards / Ars Technica
- Kyle Orland / Ars Technica
While Google Bard has been able to generate code since June and Google has touted Gemini's ability to help coders with its AlphaCode 2 system, that system utterly failed in this test. Repeated trials of the above prompt over multiple days made Google Bard hang for 30 seconds or so before generating the vague error: "Something went wrong. Bard is experimental." At least the old Bard was up front about the fact that it wasn't trained to create code yet.
AdvertisementChatGPT, on the other hand, provides the same code under both the GPT-3.5 and GPT-4 Turbo models. The simple, straightforward code works perfectly in our tests without any edits, passing our trial with flying colors.
Winner: ChatGPT, by default and by achievement.
The Winner: ChatGPT, but not as clearly
When comparing the old Google Bard to the new, Gemini-powered version, there has been clear progress in the quality of Google's AI-generated output. In our math, summarization, factual retrieval, and creative writing prompts, Google's system has shown marked improvement in the eight months since we last tested it.
Overall, though, ChatGPT is still the winner in our non-scientific tests; OpenAI's system edged out Bard on three prompts, while Bard was only the clear winner in one. But the results were a lot closer than they were back in April, as evidenced by the two prompts we judged as ties and the one "split decision" (depending on whether you compare Gemini to the free GPT-3.5 or the paid GPT-4 Turbo).
Of course, there's some subjectivity involved in judging a competition like this; you can judge the results for yourself by looking through the image galleries above. Regardless, we'll be interested to see how upcoming models like Gemini Ultra or a new model that might integrate OpenAI’s mysterious Q* technique will be able to handle these kinds of tasks in the near future.