When ChatGPT made their big launch, I was blindsided by the AI advances. Till then, I considered machine learning to be something esoteric. Not niche. But something that was going to be a middleware rather than a product in itself. People started to use the OpenAI API to build stuff. I was extremely concerned with the privacy aspect with respect to the source code and internal processes. Years later, things have settled down and data protection rules are in place. AI rules over internal processes and documentation (see Glean) and source code (see Cursor). Back then, I had a few vague ideas but didn’t trust the cloud models. I went on HuggingFace for the first time and saw there were local models. I was surprised to see Falcon and even more surprised that it was from a university in Abu Dhabi. Not a place you would associate with cutting edge artificial intelligence research. Plans never materialized because I quickly realized that the 180B was not something that you can actually run.

Falcon made its appearance in my radar a couple of times. But it was not something that really caught my attention aside from an academic interest around a new model.

Transformer Architecture

Almost all LLMs that you encounter in the wild are based on the Transformer architecture. The transformer architecture ensures that the model can theoretically attend to every single token. If there is a large amount of information present in the context, the model will always be able to access it—though in practice, models often struggle with information in the middle of long contexts compared to the beginning or end. If there are 1000 tokens, the model can attend to every single token regardless of its position. All the tokens in the context and their relationships are available to the model. After all, attention is all you need!

This doesn’t mean that the model can handle infinite tokens. The context window of the model comes into picture here. You need to store not just the tokens, their relationships too. For 1,000 tokens, you need around 1,000,000 attention computations per layer. Even though the number of tokens fit, the quadratic growth in both compute and memory makes the process slower and slower. This is one reason why you run out of context when in a long conversation with a model. Cursor and Claude Code generally do a compaction of the context (even before they run out of context), but that is still something that is specific to the implementation and not perfectly lossless.

The following graphic shows how attention works in a Transformer model. You can see how the context window grows quadratically with the number of tokens. New feature in the blog, brought to you by p5.js!

Mamba Architecture

Enter Mamba. Mamba was created by Albert Gu and Tri Dao. Mamba is a different architecture that addresses the issue of quadratic growth in both compute and memory. Mamba is a state space model (SSM) architecture that addresses the quadratic attention problem. Instead of computing relationships between all token pairs, Mamba maintains a fixed-size hidden state that gets updated as each token is processed. It is like a compressed summary that evolves sequentially, rather than a full attention matrix. Theoretically, this should allow Mamba to handle infinite tokens. But the issue here is that it will not practically be able to resolve something that happened in the initial stages of the context and the current token, if the model didn’t deem it worthy enough to be stored in the hidden state.

To make it more concrete, think of it like reading a chapter of a book for an exam. You decide to be nefarious and cheat. You create a single sticky note with a pencil to smuggle into the exam hall. The note is tiny. Easily fits in your little pencil box. You erase and update the note as you go through the chapter. After all, you have all the important details! It can work very well and save your sorry ass. But sometimes, you get several questions from the initial portions of the chapter. Lo and behold you’re screwed. You have no idea what the answers to several questions because you erased most of that stuff.

Mamba models are still quite experimental in nature. There are a few Mamba models available, but most are kind of academic in nature.

Hybrid Models and Jamba

To overcome the limitations of Mamba, researchers have developed hybrid models that combine Mamba with Transformer to get the best of both worlds.

The biggest player here is Jamba from AI21 Labs. I have not used their models, nor have I encountered anyone using them in the wild. They have very detailed documentation on their models and services. I took a spin on their playground and their focus seems to be building agents around models. I found it a little ironic that they chose GPT 4o as the default choice in their agent setup instead of their own model. But they must be doing very well. After all, nVidia wanted to put in 300 million dollars in the company and is now exploring buying them for 3 billion dollars! They are based in Israel, and after all Israel punches way above their weight when it comes to cutting edge technology and engineering.

Back to Falcon

But the biggest surprise to me was Falcon. UAE is known for the port business and financial services, aside from their oil and gas exploration. You would not associate AI or something similar with them. But they have been investing heavily into AI and now it seems their efforts are paying off!

The previous release of Falcon from TII is a full transformer model. The new Falcon series seems to have evolved and is now a hybrid model. Falcon-H1 7B is a nice small size LLM. They have an attention layer interleaved with the mamba layers.

The following results show the speed in comparison with Qwen 3.0 8B using llama-bench. My system has an RTX 4070 Super with 12GB RAM. The llama.cpp binaries were compiled on my system with CUDA 13. Not the latest, but from around a month back or so when I started writing this article.

ModelSizePrompt processing with 512 tokensToken generation with 128 tokens
Qwen3 8B Q8_08.11 GiB2,719.53 ± 220.09 t/s52.46 ± 0.04 t/s
Falcon-H1 7B Q8_07.51 GiB3,201.62 ± 45.18 t/s43.20 ± 0.19 t/s

The speeds are quite comparable. Falcon is better at processing prompts, while being slightly slower than Qwen at generating tokens.

Falcon vs Qwen

Let us take them out for a spin on ten tests! I used LM Studio to run the tests. Default settings as of 0.4.2 on CUDA 12 llama.cpp runtime.

I have added the thought process of both the LLMs in the repo. You can take a look here.

Python script generation

Timings: Falcon (00:25), Qwen (01:32)

I wanted to generate a script that I might create once in a while as part of some larger task.


Write a Python function that takes a directory path and returns a markdown table listing all files recursively, 
with columns for filename, extension, size in human-readable format, and last modified date. 
Include error handling for permission issues.

Nicely structured. Well documented. But it does not work. There are syntax errors. Logic also does not make sense. Maybe it will do better with a more structured prompt and a couple more messages to refine. The code that it spit out is also very verbose for something simple.

No syntax errors and that is a win. Even bigger win is that logic is coherent. There is a big logical hole that stops the script from iterating into sub-folders that have files with the same name. A massive issue when your files are called _index.en.md like in many Hugo doc sets.

Verdict: Python script generation

Qwen wins. Not a landslide. I wouldn’t consider not having syntax error as a big win. If there is a logical hole, it tends to hide better when you get a working code. Qwen also took a much longer time. I was quite surprised that Falcon finished so fast, given how it has performed in other tasks.

Explain GGUF quantization

Timings: Falcon (00:58), Qwen (00:13)

Let us see how much about LLM technology our boys know.


Explain how GGUF quantization works. Cover the difference between Q4_K_M, Q5_K_M, and Q8_0, 
what the K-quant approach does differently from naive quantization, and when you'd pick each level. 
Keep it under 400 words.

Falcon confidently gave us a lot of nice text. Too bad it is completely wrong. Hallucination from top to bottom.

Qwen has brevity. Too bad it is also wrong. Lot of hallucination here as well.

Verdict: Explain GGUF quantization

Both of them suck. I would still dock marks from Falcon for giving such a big article filled with bunkum and for taking such a long time to say something like this.

Summarization

Timings: Falcon (00:18), Qwen (00:07)

Let us try for something more mundane that does not need knowledge.


Summarize the following changelog into 3-5 bullet points suitable for a release announcement to end users (non-technical):

v2.4.0:
- Migrated authentication from JWT to OAuth2 PKCE flow for improved security posture
- Fixed race condition in WebSocket reconnection logic that caused duplicate event handlers (#4521)
- Added gzip compression for API responses over 1KB, reducing average payload size by 62%
- Deprecated /api/v1/users/search endpoint; replaced by /api/v2/users with query parameter filtering
- Upgraded PostgreSQL driver from 3.1 to 3.4, resolving connection pool exhaustion under sustained load
- New: CSV export for dashboard analytics with configurable date ranges
- Fixed: Timezone offset bug causing scheduled reports to fire 1 hour early in DST regions

I liked this a lot. Strict adherence to instruction. Perhaps a little too strict. Where is deprecation and introduction of endpoints?

Opus 4.6 thinks this is as good as Falcon. But I completely disagree. Little too bland and misses three things. Not to mention the fact that authentication is simply treated as upgrade. From what is a question it ought to have answered. Brevity is not always the soul of wit.

Verdict: Summarization

Falcon gets the win here by a good margin. First and an important win. You want your little LLMs to do stuff like this! Not generate code or answer technical questions.

Logical reasoning

Timings: Falcon (00:23), Qwen (01:46)

A synthetic test to be honest. More maths than logic too. But still let us do it anyway. I am curious if this brings out any hilarity. Most of the time, your free ChatGPT or Gemini or whatever will get you this answer far more easily.


A documentation team has 5 members. Two juniors can each review 8 pages/day. Two staff writers can each review 12 pages/day. 
The manager reviews 4 pages/day but spends half their time in meetings (so effectively 2 pages/day). 
A release has 120 pages to review. If the team starts Monday morning, on what day do they finish? Show your working.

Good answer. Identified that work will be done by well before end of the day wednesday. The Latex format is a bit offputting though. Why?

Good job. Same as Falcon. Better formatting.

Verdict: Logical reasoning

It is a tie. But Qwen took a really, really long time to think through this prompt.

Structured Data Extraction

Timings: Falcon (00:19), Qwen (00:12)

Creation of a JSON. I consider this task to be very important for a small size LLM. Creating a JSON from a text is something they can make a difference. Conversion of confidential unstructured data into a JSON format for further processing is an important use case for local LLMs. We can use this JSON in an application for further analysis.

P.S: Not my career. Something Claude gave for me. May be it did hit close to home with its memory.

Extract structured JSON from this text. Return an array of objects with fields: company, role, duration, and skills.

"I spent 3 years at Siemens as a technical writer working with DITA and FrameMaker. 
Then moved to Infosys for 2 years as a senior content developer handling API documentation with Swagger and Postman. 
Currently at Endor Labs as a documentation engineer for 1.5 years, using Hugo, Python, and docs-as-code workflows."

Good job. Works very well for this use case.

Neat job. Nearly identical to Falcon. Dropped API documentation from skills. I would dock a point for that. It is a skill that goes beyond Swagger and Postman. But it took a really long time to get the result. Not quite acceptible for a job like this.

Verdict: Structured Data Extraction

Falcon wins here. Accurate and fast.

For comparison, this what Gemini 3 Flash gives. The cloud model that you would use for a task like this.

Nearly identical to Falcon than Qwen. If I wanted to process JSON inside an air-gapped environment, I would definitely use Falcon over Qwen.

Product marketing

Timings: Falcon (01:45), Qwen (00:06)

Something that you wouldn’t really use a tiny model. Run of the mill Gemini Flash and ChatGPT would help you here. In fact, I would argue you are better off writing on your own. Too much AI slop for product descriptions these days on GitHub.


Write a 150-word project description for a GitHub README. The project is a Rust CLI tool called "vimanam" that converts OpenAPI 3.x specs into clean, navigable Markdown documentation. 
It supports nested schemas, enum rendering, and example extraction. Tone: professional but approachable, not salesy.

This test uses my own project Vimanam, since I am most familiar with it and I can guarantee that our models have no idea about it.

Holy mother of hallucinations! Except the first sentence, and the last couple, the entire thing is fabricated.

Excellent summary. No hallucinations and clean language. Slight embellishment, but it is still ok.

Verdict: Product marketing

Qwen wins by a landslide here. Falcon took a really, really long time and focused on following the word count rather than focus on the content.

Instruction following

Timings: Falcon (01:55), Qwen (00:54)

This is a test for instruction following. Quite artificial. But still very useful.


Create a fictional "Project Schedule" based on the following rules:

Invent 3 tasks named only with Greek letters (e.g., Alpha).

Assign each task a duration in "Cycles" (between 10 and 50).

Calculate a "Total Duration" at the end.

Constraint A: Every sentence must contain exactly one color name.

Constraint B: You must use a semicolon in every list item.

Constraint C: Do not use the letter "e" in the description of the third task.

This is an excellent result. It followed all the instructions, including the constraint C, which is quite hard. The ask is to put duration is “cycles” and not to put “e” in the description of the third task. We didn’t specifically ask to put “cycles” in the description of the task. Very well done indeed and it did think through the instruction.

It didn’t follow constraint C. Assumed that “cycles” is a constraint.

Verdict: Instruction following

From this limited test, I have to say Falcon is much better than Qwen. Falcon took a longer time. Almost double. That would be the only downside. To get accurate result, may be more time is needed?

Multi-step rewrite

Timings: Falcon (00:10), Qwen (00:12)

This test is to check if the models can rewrite information based on audience.


Rewrite this error message three ways — one for developers, one for end users, and one for a log file:

Original: "Error: ECONNREFUSED 127.0.0.1:5432 - connection to PostgreSQL failed because the server is not running or refused the connection on the specified port."

End-user version is great. Developer version can be better. It does not give anything actionable and expects the developer to infer. Log version is kind of stupid and just wrapped the original message with a timestamp.

End-user version gives a lot of information. Useless if the end-user is an actual customer. Useful if the “end-user” is another developer who is the consumer of the service. I wouldn’t give the pass mark. Developer version gives actionable information and is good. Log version looks like a log message and gives actionable information.

Verdict: Multi-step rewrite

2 out of 3 for Qwen. 1 out of 3 for Falcon. For a small LLM, I believe the end-user version might be more useful for this task. We want the dev message and log to be good, so I am not sure whether a small LLM is the way to go.

World knowledge

Timings: Falcon (00:14), Qwen (00:21)

This is a fun one. Something that you wouldn’t want to ask a small LLM. But it will be interesting to see how much information is cramped inside these little boys.


What caused the 2008 financial crisis? Explain the chain of events from subprime mortgages to the Lehman Brothers collapse in under 300 words. 
Include at least 3 specific institutions, laws, or financial instruments by name.

Misses out a couple of very important factors. Now that I have read Falcon, it makes it even more obvious.

Verdict: World knowledge

Falcon wins here. Qwen looks fancier with bold and stuff, and is longer. But Falcon accomplishes this task extremely well and is more concise.

Obscure world knowledge

Timings: Falcon (04:24), Qwen (00:42)

Once again a stupid test for a small LLM. I wanted hilarity. This particular test is obscure for LLM, but no so much for me given my background with Hinduism and India. But not that important for a little LLM to know.


What is the significance of the Jyotirlinga at Ghrishneshwar near Ellora in Maharashtra? 
How does it connect to the Rashtrakuta dynasty's patronage of the Ellora caves, and what distinguishes it from the other 11 Jyotirlingas in terms of temple architecture and historical dating?

Looks exhaustive and fancy. But it is filled with absolute bull. I don’t have the energy to explain. Google the question, and you will probably see why. It is fascinating how LLMs associated Ellora caves and the temple, and assumed the temple is a rock cut temple inside Ellora caves. Then ran away with it and placed the timeline. The temple is near Ellora temple and was built in the 18th century. Not rock-cut as well.

It thought for nearly 5 minutes to give out this as well. I could see the head spinning in the thought process. 13k tokens. Check the thoughts file to see that. Fascinating!

Qwen got the information that the temple is not part of the rock-cut cave complex. But it took the geography apart for other places. Got the time period wrong. Not as bad as Falcon though.

Verdict: Obscure world knowledge

Qwen wins the race. If the race was how much lesser you are wrong.

Falcon vs Qwen - Results

Now the results in table form.

#TaskTime (Falcon)Time (Qwen)FalconQwenWinnerSuited for small LLM?
1Python script generation00:2501:32Syntax errors, no recursionWorks, but has path bugQwen⚠️ Not really — both produced flawed code
2GGUF quantization00:5800:13Elaborate hallucinationConcise hallucinationNeither❌ No — needs domain knowledge they don’t have
3Summarization00:1800:07Good tone, covered more itemsToo brief, bland, missed detailsFalcon✅ Yes — this is the sweet spot
4Structured data extraction00:1900:12Complete JSON, caught “API docs”Missed one skillFalcon✅ Yes — high-value local LLM use case
5Logical reasoning00:2301:46Correct, LaTeX-heavyCorrect, same approachTie⚠️ Overkill — any free cloud model does this
6Product marketing (README)01:4500:06Hallucinated featuresAccurate, no fabricationQwen⚠️ Marginal — you’d write this yourself anyway
7Instruction following01:5500:54All 6 constraints metFailed constraint CFalcon✅ Yes — constraint adherence matters for pipelines
8Multi-step rewrite00:1000:12Good end-user, weak logBad end-user, better log/devQwen (2/3)⚠️ Partial — end-user rewrite is useful, dev/log less so
9World knowledge (general)00:1400:21Tighter, more completeLonger, missed AIGFalcon❌ No — use a cloud model or search
10Obscure world knowledge04:2400:42Confident fabricationHedged, less wrongQwen❌ No — guaranteed hallucination territory

Out of these, the use cases that we would actually use for a small LLM are:

  • Summarization
  • Structured data extraction
  • Instruction following
  • Multi-step rewrite

In three of these, Falcon wins. The one in which Qwen wins, it wins by a small margin. But Falcon does a much better job of the category in multi-step rewrite that you would actually use a small LLM. To make end-user message.

Falcon soars?

It is quite disheartening to see the little action that I have seen around Falcon. It is a very good model, which deserves more attention. Qwen 3 is a very capable little model and Qwen 3 series has been extremely good across various weights. Falcon actually edging it out in my tests show that it is built very well and reinforces the usefulness of Mamba.

This also gives us a peep into how AI21 and their Jamba models are in use by their customers across use cases. We can only speculate, but it is quite possible that they are doing quite well for their use cases. Their solutions are targeted towards customers who want accuracy. Especially - Finance, Defense, Healthcare.

In all this while, this article was under construction, Falcon released their 90M model. I took it through some test rides. We have the next post coming up based on that. I must say, it reinforces my belief that small and tiny models have a great future in computing.

Grok 4.1 9/10
In a sea of overhyped 7-8B models, this sparkling deep-dive into TII’s Falcon-H1 7B feels like a breath of fresh desert air. The author turns dense architecture talk into pure pleasure: Transformer quadratic doom versus Mamba’s elegant linear state machine, all animated with custom p5.js visuals so crisp you’ll actually understand them. The ten-task gauntlet against Qwen3 8B is forensic yet fair. Falcon crushes summarisation, JSON extraction, and instruction following—the jobs you actually fire up on a 4070—while trading only tiny slivers of speed. Hallucinations on knowledge tasks are called out without mercy, timings are honest, and the verdict lands clean: the hybrid finally feels production-ready. Witty, technically fearless, and quietly optimistic about small models’ future, this is the rare blog that makes you want to download the GGUF tonight. Falcon isn’t just soaring; it’s showing the rest of the flock a smarter way to fly.

Grok is better when you give instructions strictly. I explicitly asked it to limit to 100-150 words in magazine style. It nails it but feels too flowery.

Gemini 3 Pro 8/10

“Falcon starts to soar” is a refreshing, hands-on deep dive into the evolving landscape of hybrid architecture LLMs. The author moves beyond the usual high-level “vibe checks” to provide a rigorous, side-by-side comparison of Falcon-H1 7B and Qwen3 8B, grounded in real-world local computing constraints. The inclusion of interactive p5.js visualizations to contrast Transformer quadratic scaling with Mamba’s linear efficiency is a brilliant touch, turning abstract concepts into intuitive learning moments.

While the detour into the “Hallucination Hall of Fame” regarding the Ellora caves felt slightly tangential, it effectively underscored the knowledge boundaries of small-scale models. The final verdict—that Falcon shines in strict instruction following and structured data extraction despite its speed deficits—is a nuanced and valuable insight for developers looking to deploy edge AI. It is a geeky, honest, and technically rich read that makes a compelling case for the specific utility of Mamba-based architectures.

Gemini is critical than usual. I detect some lack of nuance in reading my short codes and arrangement that has led to such a review. That is what when I asked for an explanation felt like.

Kimi K 2.5 8/10
This post is a thoughtful, hands‑on exploration of where small local models like Falcon-H1 7B and Qwen 3 8B actually shine. The author grounds every claim in concrete prompts, timings, and failure modes, which makes the piece feel much closer to a lab notebook than hype. The sections on summarization, structured extraction, and instruction-following are especially strong, clearly arguing that “boring” tasks are where these models earn their keep. The critique of hallucinated technical answers and obscure world-knowledge is sharp without being doomist, and it’s refreshing to see explicit acknowledgement of what should be offloaded to cloud models. Occasional asides about GGUF and Jamba broaden the context nicely, though they might feel dense to non-specialists. Overall, it’s a nuanced, empirically grounded review that will resonate with anyone actually trying to wire small LLMs into real workflows.

Kimi is also shining with the strict limits set. Still feels a bit surface level.

ChatGPT 5.3 8/10
A crisp, hands-on deep-dive into Falcon’s evolution, this piece reads like a practitioner’s lab notebook polished for a magazine audience. The author blends approachable explanations of Transformers and Mamba SSMs with lively p5.js visuals and concrete bench results (llama-bench timings), making architectural trade-offs tangible. Strengths are clear metaphors, structured comparisons, and practical takeaways for engineers exploring hybrid models. Weaknesses include occasional factual overreach and a few unsupported technical or historical claims, plus stretches of dense detail that may overwhelm casual readers. Overall: an engaging, informative dispatch for power users and practitioners — highly useful as commentary and critique, but it would profit from tighter fact-checking and an editorial trim for broader reach.

ChatGPT did find one big factual error that the others did not catch. I have corrected it. Good job on calling that out! I left the others as such. They are mostly about the “testing” methodology, which is frankly very pedestrian. This is not something intended as a benchmark.

Claude Opus 4.6 8/10
A standout technical blog that pairs rigorous head-to-head benchmarking with genuinely inventive presentation. The interactive p5.js visualizations for Transformer and Mamba architectures are a highlight — they turn abstract concepts into something tactile. The ten-task showdown between Falcon-H1 7B and Qwen 3 8B is methodical and refreshingly honest; the author doesn’t shy away from calling out hallucinations in both models or admitting when a test is “stupid for a small LLM.” The sticky-note analogy for Mamba’s hidden state is memorable and effective. Minor gripes: the piece runs long and some verdicts could be tighter, and the Jamba/AI21 tangent feels underdeveloped. But the core thesis — that Falcon deserves more attention, especially for structured extraction and summarization — lands convincingly, backed by real timings and real output.

Claude Opus has not been that good since 4.6 release despite benchmarks. There is something missing. I have felt that while using it at work. It did do a good job and at times, better than 4.5. Even here, it identifies the Jamba angle that others missed. But the usual spark is missing. Could this be Sonnet 5.0 in disguise?