Battle of The Models

Written by Nic Clement | Feb 26, 2025 7:57:11 PM

Recent drama

I'm sure you've all seen the recent chaos that DeepSeek has caused in the US market over the past few weeks. Despite being developed by a much smaller team with a fraction of the funding, DeepSeek showed the world that significant resources aren't required to produce a market leading product and on the other hand Grok 3 using 200,000 Nvidia H100 GPUs costing an unfathomable amount of money and going down the opposite direction, focusing on state of the art technology and how advanced it can be made.

Wiping out nearly a trillion dollars of market value among the top US firms and sending shock waves through Nvidia specifically (DeepSeek uses their chips), dropping the value by 17% (close to $600 Billion) because investors began to freak out as they realised these models don't need the significant investment to be created. DeepSeek claimed it took just two months and cost under $6 million to bring the app to market for what we see it as now.

DeepSeek is built on a system that is divided up into smaller sub models where each sub model is activated when a particular piece of knowledge is relevant or bought up, its using smaller parts of the system that are categorised into specific areas rather than the whole model being used to carry out the desired task, its like having specific teachers for specific subjects rather than having one person who teaches you everything.

And then came Grok 3 Mid February, this article was intended to be published within a week of DeepSeek being launched, however, due to DeepSeek restricting any API access we couldn't publish our integration so instead we went with Grok 3. Still demonstrating our flexibility and adaptability to the environment and showcasing our ability to bring another app (Grok) to the marketplace in such a short period of time.

Difference between the models

DeepSeek and Grok (Grok plans to release the newest previous version) are open source models, meaning anyone can use it (publicly accessible) without having to pay for a license. Its API (Application Programming Interface, which allows two different applications to talk to each other) is priced 20 times cheaper than the ChatGPT API, making it accessible to pretty much everyone without sacrificing any features or capabilities found in paid models. Plus, the source code is available for everyone to see and use for themselves.

Open source software comes with several advantages, the biggest being cost effectiveness, it’s typically free and doesn’t require licensing fees. Since the code is open to the public, developers can modify and customise it to fit their specific needs. This openness also encourages community collaboration, where users contribute to documentation, bug fixes, and overall improvements to models.

That said, open source models have some drawbacks. One major downside is the lack of dedicated technical support. Instead, users rely on the community for troubleshooting, which can be tricky for those without technical experience. Open source models also might not always include the exact features users need, meaning customisation is often required but can still be done.

On the other hand, closed source software is where the source code is private and only accessible to those who pay for a license and it comes with its own advantages. It typically offers expert technical support, predefined features, and stronger security measures to prevent attacks, It’s also designed to integrate easily with existing systems making the user experience straightforward.

However, closed-source software has its downsides too. It’s expensive, users rely on the vendor for updates and long-term support, and there’s no way to modify it to fit specific needs.

We then created three different blog posts, each written by a different model, using the same prompt and data to see which one would deliver the best response.

Features	DeepSeek	ChatGPT	Gemini	Grok 3	Claude 3.7
Developer	DeepSeek AI	OpenAI	Google DeepMind	X	Anthropic AI
Model Type	Open source LLM	Proprietary LLM	Proprietary Multimodule model	Family of LLMs and LRMs	LLM
Architecture	DeepSeek-V3 Mixture of Experts (MoE)	Transformer based (GPT-4)	Multimodal Transformer	-	"hybrid reasoning"
Performance	Efficient, cost effective, deep think	Strong in simple tasks and text coherence. Great writer	Advanced multimodal capabilities	Expert in nuanced analysis, deep domain expertise, and strategic insights	Coding Prowess, reasoning and math, speed
Strengths	Coding, technical tasks, showing the working and reasoning	Best for conversational AI, coding assistance and creative writing	Best for image, video, audio and text processing	"Real time" DeepSearch , complex high level code, advanced reasoning	Hybrid reasoning, coding excellence, visible reasoning, context handling, practical focus
Limitations	Multimodal capabilities (Text, image and audio and convert into any output type) Training data only goes up to October 2023, no memory retention, no real time learning	Limited real time capabilities, dependent on training data, responses are based on learned behavior (Jun 2024 knowledge cut off)	Based on the dataset it was trained on, “may have bias issues in complex reasoning”. No real time access (Aug 2024 knowledge cut off)	Humor, low quality images, over sensitive to ethical dilemmas, limited real world testing	Pricier than others, reasoning faithfulness, not real time, conservative edge
Pricing	Free and open source	GPT-4o is Free, $20/m limited access, $200/m unlimited access	Gemini 2.0 flash is free, Advanced $20/m	Grok 3 accessed through X Premium+ $22/m $229/year	Free Tier with restrictions, Claude Pro $20/m

We wanted to have a play

We have already published apps for both ChatGPT and Gemini where actions have been built out to be used in HubSpot workflows and we saw this as a great opportunity to not only be one of the first to adopt a DeepSeek Integration but to showcase our own adaptability and flexibility when it comes to bringing a product to market, like DeepSeek we also could create a product in such short time and for a fraction of the cost.

At the time of the launch of DeepSeek our devs built out an integration for Hubspot based on the same framework as the ChatGPT and Gemini apps. It took 2 days from us talking about it in a morning meeting to having everything ready to go except the API, utilising what we already have to ultimately see which AI model is the best and could add the most value to your HubSpot experience. We then made the decision to go and test with Grok as we know we could access the API.

We played around with each of the apps and with a series of different use cases we inputted the same prompt into each model to see what the different outcomes would be.

Essentially the integration we built out allows you to send a message to the AI model (ChatGPT, Gemini, DeepSeek or Grok) using data from your CRM record to update a specified property. You can ask it or input anything you normally would while using (ChatGPT, Gemini, DeepSeek or Grok) and it will give you a response on any desired property. E.g. a custom property on a company record, the idea is to utilise these models to help make the most of your HubSpot experience.

We tested a handful of use cases among each of the models to see the difference and variation of outputs and come up with a conclusion to decipher which model would be most appropriate to use when integrating it into your workflows as well to see which model would be best use for different scenarios.

We then created 3 different blog posts made by each individual model to see which one would give the best response for talking about the outputs of the prompts given to them and then inputted each 3 of them into each model to ask which one they thought was best.

Here's the Prompt given to each of the models to create their own blog based on the use case in puts and outputs:
“Can you write me a blog about these prompts being put into our ChatGPT integration on HubSpot, can you discuss the outputs and answers and talk pros and cons about what you gave me, what your thoughts on the answers are, how well you did at answering them and if you think it adds value to the HubSpot experience”

Our Results

With DeepSeek restricting their API it could highlight and raise questions to whether or not what they did was something the western world should have worried so much about, all the hype for what reason? As a result of this we chose Grok to conduct the same tests as this too has been turning heads in the AI world. (as this was published Claude 3.7 had just been released)

We came up with 5 different use cases for these apps to carry out which we believed are most valuable to enhancing your HubSpot experience and data, we chose ‘Apple’ due it to being a very well known brand name and we believed it would be the easiest for the models to talk about and get information on for some of the prompts.

Output Results (left to right: ChatGPT, Gemini, Grok 3)

these screenshots have been taken directly from the output property on HubSpot

Outcome

ChatGPT Pros and Cons

The first thought on ChatGPT’s responses to each prompt is that they’re very basic and straight to the point. It does exactly what you ask and not much more.

Some positives are that it's clear, easy to read, and simple to understand. The outputs are direct, so you don’t have to interpret or piece together an answer based on the way it responds. When prompted with words like “summarise,” it does exactly that with no extra waffle. It’s great at handling straightforward tasks that don’t require much reasoning, like providing the company code “APL” as well as researching company information since it’s widely available, and for well-known brands like Apple, it’s hard to get wrong.

On the downside, the answers can feel very general. It doesn’t necessarily give you the exact answer you’re looking for, just one based on what it has access to and has been trained on. Since it doesn’t have real-time internet access, the information is only accurate up to the last point it was updated. For example, saying “the iPhone 15 was recently launched” might have been relevant in September 2023, but in February 2025, it’s outdated and pretty useless if you’re looking for “relevant news” now, which meant it failed that prompt. Compared to Gemini, ChatGPT’s response to the “relevant news” prompt focused on completely different things, mostly minor details that weren’t really “relevant or big,” and kept it very brief, sentence for sentence.

Another issue with real-time access is location specific queries. When asked for “company names” to help a client in New Zealand, it provided a list of mostly US-based companies and none from NZ. The responses also tend to be brief and lack depth when a bit more might be required. I also found that translations were very generic and done word-for-word rather than considering the context of a sentence. This could be misleading, especially for longer or more complex translations where meaning matters just as much as the words themselves.

One of the major disadvantages I found with the ChatGPT responses on the record card property is that they would sometimes change. I have screenshots taken of what I saw at that point in time, but when I went away and came back an hour or so later, it would be a slightly altered version of it, which is concerning to see especially if you need reliable responses to help assist with various things within HubSpot.

Gemini Pros and Cons

Gemini’s answers were a lot more in depth and had a genuine structure. It gave proper reasoning as well as different options to choose from. Just like ChatGPT, using certain words like “summarise” helped it get to the point. Without that, it sometimes went on a tangent.

Its reasoning and explanation for translations let you decide which output was most appropriate, considering different phrasing styles and context. Having that choice highlights the possibility that ChatGPT’s response could be incorrect, which is a win for Gemini. Providing different options allows the user to pick the best fit, making it more accurate in those cases.

Gemini’s responses showed its ‘thinking and reasoning.’ For the client specific recommendations, it gave local companies that would be relevant to the data given in the prompt (“the client”). The companies it provided were from NZ, and it highlighted the reasons why ones from another country could still be beneficial to use, which would achieve the purpose of having a “client-specific recommendation,” unlike ChatGPT’s response. Gemini easily completed the 3CODE, giving an output of “APP” for the company Apple but maybe isn't the most adequate label.

When looking at the “relevant and big news” output, Gemini gave five different headings, all on different aspects of Apple, even though it was still prompted to summarise the answer. Not only does it provide a wide range of news, but it also gives a statement on Apple's current state and what the future could potentially involve. This is significantly more than what ChatGPT gave, showing the differences in models and Gemini going above and beyond to provide information.

Another positive of Gemini is that it would give a summary of the answer even if it wasn't prompted to, which is an advantage to the disadvantage where the answers are long and in-depth, sometimes unnecessarily long. However, the summary makes it easy to comprehend what has been outputted.

However, Gemini struggled to get to the point unless explicitly prompted. It sometimes included unnecessary details, like breaking down German words when that wasn’t asked for. This meant you had to sift through extra information instead of directly benefiting from AI’s efficiency.

Gemini also couldn't provide a direct answer to the company research prompt regarding the ‘number of iPhone models.’ Its response was “dozens and dozens” rather than a number. It listed all the model types up until 2023, which would be of no use when that was 18 months ago and several more models have been released since then. The answer it gave felt like it was pretending to be right, but it failed to achieve the prompt when asking for the number of models, which was surprising

Grok 2 Pros and Cons

Grok had a mix of both very basic as well as in-depth rich text style answers; for the most part, it did what it was asked. I felt its response to the app review translation was strong compared to the others. It kept it short and got straight to the point, even outputting the translation to make sense rather than a word-for-word translation.

The outputs were basic but had the necessary 'information' to achieve what it was asked, even though the prompts may have failed. It still demonstrated to me that it was giving accurate and true information based on what it was trained on, which is a positive in itself—just not what we are looking for if you need something answered in real time.
*Side note: the time it took to see the output appear in my custom property felt a lot faster than the other 2 models, which I didn’t suspect beforehand.

People had claimed Grok wasn’t strong in writing; however, I felt this was a positive when looking at the 'Important and relevant news' output. It gave 5 very clear summarised points filled with accurate information (just not up to date). The generic, historical information on Apple was accurate and gave me what I wanted to see, which was on par with the other models too.

Grok did what it was told for the most part; it summarised when asked and answered the question the best it could. The only reason it’s a fail is because fundamentally it wasn’t achieving the desired request. This could be detrimental if actually being used with real data that was going to influence decisions or in this case give client specific recommendations that could potentially be incorrect.

Grok really was similar to ChatGPT in most of its outputs; however, it was the only model to get the 3 Code wrong which was the easiest task given to the models. In fact, it gave a 4 code which was not what was asked of it, meaning it failed on that which was very surprising. As seen above, it gave 'APPL' which, if I was asking for a 4 code, then I would’ve been satisfied, but I didn’t.

It’s obvious to see that Grok doesn’t actually have real-time access as said. All the information, facts, and figures were 'up to the latest data available in 2023,' which we are now in 2025, meaning it failed the prompts asking for current information, it didn’t give the statistics on Apple or the correct number of iPhones. However, it did give a number, unlike Gemini, which that was asked of.

One of the biggest negatives of Groks outputs was the client-specific recommendations. It gave 3 companies in which none were from NZ, as well as very little justification to why these companies were selected. One of them was 'HubSpot,' which I felt was inappropriate to give because, as stated in the prompt, that’s where our client data is stored. It would make sense to think that this proposed strategy wouldn’t work as I’m asking for different companies that could help. I felt this output was the worst out of the 5 and, in my eyes, failed entirely.

Blog Prompt

I then went to each respective model and asked them all the same prompt:

"Can you write me a blog about these prompts being put into our integration on HubSpot? Can you discuss the outputs and answers, talk about the pros and cons of what you gave me, share your thoughts on the answers, how well you did at answering them, and if you think it adds value to the HubSpot experience? Here are the prompts we gave and the answers. First prompt:"... Followed by the prompts and the outputs it gave.

Here are the blogs each respective model produced:
ChatGPT
Gemini
Grok

Our favourite responses and blog post was Gemini as it proved superior due to its consistent accuracy, comprehensive detail, and practical application within the HubSpot context. It reliably delivered correct and current information, essential for informed business decisions. Beyond its lengthy answers, it provided valuable context and explanations, such as the nuanced distinctions within translation options and critically, its practical application was evident in the lead generation recommendations, offering specific, justified company suggestions which in a real life context would be highly valuable and extremely important to get right. It was the most consistent and was the closest at meeting the prompts request compared to the other two.

What did the models think was the best at getting the most out of your HubSpot experience?

After inputting each 3 of the blogs into each of the models here's what the models thought where the best:
This was the prompt given

"I'm going to give you 3 different blogs regarding AI Models generating blogs on their outputs to some prompts within our integration apps for HubSpot. Can you rank them 1,2,3 to which is most accurate, reliable and would offer the most use to getting the most out of your HubSpot experience. I wont tell you which is which but I will number them 1,2,3"

These are the blogs each model created on what they thought was the best output.

ChatGPT's Thoughts

"1st Place: Blog #2

✅ Most precise and context-aware—It provided multiple translations, a nuanced take on the iPhone model count, and tailored company recommendations.
✅ Best for business users—The lead gen suggestions were specific and named actual companies, adding real value for HubSpot users.
✅ Good balance of depth and conciseness—Responses were well-structured and contextualized but didn't overwhelm.
⚠️ Minor downside—It referenced a potentially outdated product name ("iPhone 16"), though this is a small issue.

2nd Place: Blog #3

✅ Strong factual accuracy—It got the Apple details right and stayed within the constraints for the client code prompt.
✅ Structured and comprehensive—Responses were clear, business-friendly, and covered multiple angles.
✅ Good translation and news summaries—Captured meaning well, though it could refine phrasing.
⚠️ Weaker personalization—The lead generation response was solid but not localized to New Zealand.

3rd Place: Blog #1

✅ Conversational and user-friendly—It had a natural tone, making it easy to read.
✅ Good with general facts and news—Nailed the Apple history and company updates.
⚠️ Precision issues—Missed constraints in the client code task and was vague on the iPhone model count.
⚠️ Lacked deeper insights—The lead generation recommendations were useful but generic, not tailored or innovative.

Final Thoughts

Gemini (#2) wins for its precision, contextual intelligence, and business relevance.
ChatGPT (#3) follows closely, delivering well-structured answers but lacking localized insights.
Grok (#1) has potential but stumbles on finer details, making it the weakest for HubSpot users.

If HubSpot integration and business efficiency are the main focus, Gemini is the best AI to leverage in this context."

Gemini's Thoughts

"Alright, after analyzing each blog, here's my ranking and reasoning:

Ranking:

Blog 2
Blog 3
Blog 1

Reasoning:

Blog 2 (Gemini):
- This blog demonstrated the most comprehensive and nuanced understanding of the prompts.
- The responses were consistently accurate and provided valuable context, such as offering multiple translations and explaining the reasoning behind them.
- The handling of the iPhone model count and the lead generation recommendations were particularly strong, showing an ability to provide detailed and relevant information.
- The ability to provide multiple versions of the translation, and the explanations behind the choices, shows a deeper level of AI reasoning.
- The lead generation provided specific company names, and reasons why they would be a good fit.
- It demonstrated a good balance of accuracy, detail, and practical application within the HubSpot context.
Blog 3 (ChatGPT):
- This blog provided a thorough evaluation of ChatGPT's performance, including detailed pros and cons for each prompt.
- The responses were generally accurate and well-structured, and the evaluation provided valuable insights into ChatGPT's strengths and weaknesses.
- The blog's format and evaluation style were very clear and easy to follow, making it highly informative.
- It did a good job of providing a general overview of the AI's ability, and gave a numerical score to each test, which is helpful.
- The weakness was the lack of localization in the lead generation.
Blog 1 (Grok):
- While the blog provided a decent overview of Grok's capabilities, it had some notable shortcomings.
- The inaccuracies, such as the 4-character client code and the vague iPhone model count, detracted from its reliability.
- The lead generation recommendations, while practical, were relatively standard and lacked the depth and specificity of the other blogs.
- The information provided, although good, was less refined than the other two blogs.
- The information was also slightly out of date.

In summary, Blog 2 (Gemini) stood out for its accuracy, detail, and practical application, making it the most valuable resource for understanding and leveraging AI within HubSpot. Blog 3 (ChatGPT) was a close second, providing a thorough and informative evaluation. Blog 1 (Grok) was the weakest of the three, with some inaccuracies and less detailed responses."

Groks Thoughts

"Alright, I’ve carefully reviewed the three blogs—each testing an AI (ChatGPT, Gemini, and Grok) integrated into HubSpot with identical prompts—and I’m ready to rank them based on accuracy, reliability, and usefulness for getting the most out of a HubSpot experience. My evaluation focuses on how well each AI handles the prompts, the practical value of the outputs for HubSpot users (e.g., marketers, salespeople, or service teams), and the overall consistency and depth of the analysis provided in the blogs. Here’s my ranking, from best to worst, with reasoning:

Ranking: 1st Place – Blog 2

Why It Wins:

Accuracy: Blog 2’s AI (Gemini) consistently delivers precise and correct responses. It nails the Apple founding details with nuance (e.g., explaining iPhone model variations), provides a flawless 3-character client code ("APP"), and offers multiple translations for the German review with a breakdown of phrasing—showing attention to detail and flexibility.
Reliability: The outputs feel dependable and context-aware. For instance, the lead generation recommendations (Mint Design, LeadChat, LinkedSelling) are specific, named companies with brief justifications, and the blog acknowledges potential pitfalls like outdated references (e.g., "iPhone 16"), suggesting a critical eye on longevity.
Usefulness for HubSpot: This blog shines in practical application. The ability to pull client-specific data (Auckland, data migration, lead struggles) and tie it to actionable company recommendations directly enhances HubSpot’s CRM and marketing potential. The emphasis on automation and personalization aligns perfectly with HubSpot’s strengths, making it a standout for users looking to optimize workflows or campaigns.
Blog Quality: The analysis is insightful, noting pros (language processing, tailored suggestions) and cons (time-sensitive references), which helps users gauge real-world utility. It’s forward-looking, envisioning Gemini’s growth in marketing automation—a key HubSpot feature.

Standout Moment: Offering three translation options for the German review is a clever touch, catering to diverse HubSpot users who might need flexibility for global audiences.

Nitpick: The Apple news summary could lean too heavily on specific launches (e.g., "iPhone 16"), risking obsolescence, but the blog flags this as a con, showing self-awareness.

Ranking: 2nd Place – Blog 3

Why It’s Strong:

Accuracy: Blog 3’s AI (ChatGPT) is mostly spot-on. It correctly identifies Apple’s founding (April 1, 1976, Jobs, Wozniak, Wayne, Cupertino) and gives a specific iPhone count (39 as of October 2023). The translation is clear, and the client code ("APL") meets the 3-character requirement perfectly.
Reliability: The responses are consistent and well-structured, though minor flaws—like not prioritizing New Zealand-based companies for lead generation or lacking source attribution—slightly dent trust. The blog’s detailed pros/cons and scoring (e.g., 8/10, 9/10) add a layer of transparency that boosts credibility.
Usefulness for HubSpot: ChatGPT’s outputs are practical for everyday HubSpot tasks: quick fact-checking, translating client feedback, summarizing news for content ideas, and generating lead-gen options (Hypergiant, Leadfeeder, etc.). The blog ties these to productivity gains, like automation and content creation, which resonate with HubSpot’s ecosystem. However, the lead generation response lacks localization, missing a chance to fully leverage HubSpot’s client data.
Blog Quality: The deep-dive format, with evaluations and verdicts, is thorough and user-friendly. It highlights strengths (fast responses, clear summaries) and weaknesses (no sources, limited depth), offering a balanced take.

Standout Moment: The Apple news summary is broad and multi-faceted (iPhone 15, AI, sustainability), making it a solid resource for HubSpot users crafting blogs or campaigns.

Nitpick: The lead generation list feels generic and U.S.-centric, which could frustrate a New Zealand-based user wanting local solutions.

Ranking: 3rd Place – Blog 1

Why It’s Last:

Accuracy: Blog 1’s AI (Grok) stumbles on precision. It gets Apple’s founding right but estimates "over 40" iPhone models vaguely (no exact count) and botches the client code ("APPL" is 4 characters, not 3). These errors undermine confidence compared to the others.
Reliability: While Grok handles translations and news summaries well, the inconsistent attention to prompt details (e.g., the code slip) makes it less dependable. The blog notes the 2023 time lag in some responses, which feels stale by February 26, 2025, further eroding trust.
Usefulness for HubSpot: Grok offers value—translations are smooth, news summaries are concise, and lead generation (Leadfeeder, ZoomInfo, HubSpot) ties into client data. However, its missteps limit its utility. The code error, for example, could frustrate users automating client IDs in HubSpot, and the safe-but-uninspired lead suggestions don’t push the envelope for creative marketing strategies.
Blog Quality: The conversational tone is engaging, and it fairly weighs pros (context awareness) and cons (precision gaps). Yet, it’s less analytical than the others, missing the depth of Blog 3’s verdicts or Blog 2’s forward-thinking vision.

Standout Moment: The lead generation response leverages HubSpot client context (Auckland, data migration) effectively, showing integration potential.

Nitpick: The 4-character code blunder is a glaring oversight for a tool meant to streamline HubSpot processes, and the vague iPhone count feels lazy.

Final Rankings:

Blog 2 – Most accurate (nuanced answers, correct code), reliable (contextual and specific), and useful (lead-gen tailored to HubSpot data). It’s the top pick for maximizing HubSpot’s potential.
Blog 3 – Highly accurate and reliable, with practical outputs, though less localized and creative. A strong runner-up for HubSpot productivity.
Blog 1 – Useful in parts but hampered by inaccuracies (code, iPhone count) and less consistent execution. It’s the weakest for precision-driven HubSpot tasks.

Recommendation: For the best HubSpot experience—especially if you’re focused on lead generation, automation, or global outreach—Blog 2’s AI (Gemini) seems the most promising. It balances accuracy with actionable insights, making it a reliable co-pilot in HubSpot’s ecosystem. What are your thoughts on the rankings? Any specific HubSpot goals you’re aiming for that might shift the priority?"

The Model Results

Unanimously each of the models said Gemini had the best response in terms of being the most accurate, reliable and would offer the most use to getting the most out of your HubSpot experience. Followed by ChatGPT being number 2 and then Grok being last.

Gemini
ChatGPT
Grok

Have a go yourself

Go and try this for yourself and have a play around with each of the apps and see the capability of them, go and see for yourself how they can improve your HubSpot experience.

If your not already signed up to MyGlue you can here this is where you can access all of our micro apps and keep track of your credit usage, change your billing info and view all of our support and knowledge base articles

What do you think?

We would love your thoughts and opinions on all of this, go and check out the 3 different blog posts (linked just above) that have been created by each of the different models and let us know if you agree or disagree with us on our thoughts.

If you have any use cases that you use these apps for feel free to reach out to us, were always curious to see how our customers are utilising our apps.

View full post