[오픈소스 AI] 1GB 미만의 초소형 비전-언어 모델, LFM2-VL을 소개합니다.

안녕하세요,

최근 인공지능 분야에서는 시각 정보를 이해하고 해석하는 비전 AI 모델의 발전이 두드러지고 있습니다. 이 가운데 미국의 Liquid AI는 효율성과 실용성을 모두 갖춘 새로운 비전-언어 모델 시리즈 LFM2-VL을 공개했습니다. 특히 가장 작은 버전은 모델 크기가 1GB도 채 되지 않을 정도로 경량화되어, 개발 환경이나 디바이스 제약이 있는 상황에서도 활용이 가능하다는 점에서 주목받고 있습니다.

이번 포스팅에서는 LFM2-VL 모델의 구조, 특징, 그리고 실제 테스트 결과에 대해 알아보겠습니다.

LFM2-VL 모델이란

LFM2-VL은 Liquid AI에서 공개한 시각-언어 모델로, 이미지와 텍스트를 동시에 이해하고 처리할 수 있는 멀티모달 구조를 갖춘 비전-언어 모델입니다. 본 모델은 기존 LFM2 언어 모델을 기반으로 확장된 버전으로, 저지연 추론과 디바이스 친화적 배포를 주요 목표로 개발되었습니다. 텍스트와 이미지 입력을 결합하여 멀티모달 추론을 수행하며, 리소스가 제한된 환경(예: 노트북, 모바일, 단일 GPU, 임베디드 디바이스)에서도 효율적으로 작동하도록 경량화 및 연산 최적화가 이루어졌습니다. 공개된 버전으로는 450M급과 1.6B급 모델이 우선 출시되었으며, 이후 확장형 버전인 3B 모델이 추가로 공개되었습니다.

Liquid AI 공식블로그 : https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models

LFM2-VL: Efficient Vision-Language Models | Liquid AI

Today, we release LFM2-VL, our first series of vision-language foundation models. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the visi

www.liquid.ai

[모델 정보]

항목	내용
버전	LFM2-VL-450B	LFM2-VL-1.6B	LFM2-VL-3B
파라미터 수 (LM Only)	350 B	1.2 B	2.6 B
개발사	Liquid AI, Inc.
아키텍처	하이브리드 비전-언어 구조 (Transformer 기반 + SigLIP2 NaFlex 비전 인코더)
컨텍스트	32,768 tokens
라이선스	LFM Open License v1.0
특징	초거대 멀티모달 모델, 장문 시각 추론 및 복합 질의 대응	경량 모델로, 효율적 인퍼런스 및 소형 GPU 친화적	512×512 입력 지원, 패치 단위 이미지 처리, 대화형 인터페이스 지원
모델 경로 (허깅페이스)	https://huggingface.co/LiquidAI/LFM2-VL-450M/tree/main	https://huggingface.co/LiquidAI/LFM2-VL-1.6B/tree/main	https://huggingface.co/LiquidAI/LFM2-VL-3B/tree/main

주요 특징

LFM2-VL 시리즈가 갖춘 주요 특징은 다음과 같습니다.

효율성 및 저지연 설계
- GPU 기준으로 기존 유사 모델 대비 최대 2배 빠른 추론 속도
- 특히 모바일이나 엣지 디바이스 환경을 염두에 두고, 메모리·연산 자원을 덜 사용하는 구조로 구축
유연한 이미지 해상도 및 토큰 처리
- 기본적으로 512 × 512 픽셀까지의 이미지를 왜곡 없이(native) 직접 처리할 수 있는 설계
- 더 큰 이미지에 대해서는 512×512 크기의 패치로 분할 처리하며, 특히 고용량 이미지의 경우에는 ‘썸네일(축소 이미지) + 패치’ 전략을 사용하여 전체 맥락을 유지
- 이미지 토큰 수 및 패치 수 등을 추론 시점에서 사용자 지정할 수 있어, 속도와 품질 간의 균형을 조정 가능하도록 설계
모듈형 아키텍처
- 언어 모델 백본, 비전 인코더, 그리고 멀티모달 프로젝터로 구성
- 비전 인코더로는 SigLIP2 NaFlex라는 구조가 활용되었고, 언어모델 측면에서는 LFM2 시리즈의 언어 백본을 활용

벤치마크 성능

LFM2-VL 시리즈는 전반적으로 모델 크기 대비 우수한 멀티모달 이해 성능을 보여주는 비전-언어 모델군입니다. RealWorldQA, MM-IFEval, OCRBench 등 주요 벤치마크에서 안정적인 결과를 기록하며, 특히 시각 정보와 텍스트를 통합적으로 처리하는 능력에서 높은 효율성을 보였습니다. 또한 다양한 하드웨어 환경에서도 원활히 동작하도록 최적화되어 있어, 경량성과 성능의 균형이 잘 잡힌 실용적인 비전-언어 모델로 평가됩니다.

항목	3B 모델		1 ~ 2B 모델			1B 미만 모델
항목	LFM2-VL- 3B	Qwen2.5-VL- 3B	LFM2-VL- 1.6B	InternVL3- 2B	InternVL3- 1B	LFM2-VL- 450M	SmolVLM2- 500M
RealWorldQA	71.37	65.23	65.23	65.10	57.00	52.29	49.90
MM-IFEval	51.83	38.62	37.66	38.49	31.14	26.18	11.27
InfoVQA (Val)	N/A	N/A	58.68	66.10	54.94	46.51	24.64
OCRBench	822	824	742	831	798	655	609
MMStar	57.73	56.13	49.53	61.10	52.30	40.87	38.20
MathVista	N/A	N/A	51.10	57.60	46.90	44.70	37.50
SEEDBench_IMG	N/A	N/A	71.97	75.00	71.20	63.50	62.20

RealWorldQA : 실제 사진과 장면을 기반으로 한 시각적 질문에 대한 정확한 답변 능력을 평가하는 벤치마크입니다.
MM-IFEval : 텍스트와 이미지를 함께 이해하고 논리적으로 추론하는 멀티모달 통합 이해 능력을 측정합니다.
InfoVQA (Val) : 인포그래픽, 표, 다이어그램 등 시각적 정보가 포함된 이미지를 이해하고 해석하는 능력을 평가합니다.
OCRBench : 이미지 속 텍스트(문자 인식) 처리 및 문맥 이해 성능을 측정하는 OCR 관련 평가 지표입니다.
MMStar : 멀티모달 상황에서의 일반 상식, 논리적 추론, 복합 질의 응답 능력을 종합적으로 평가합니다.
MathVista : 수식, 도형, 그래프 등 시각적 수학 문제를 이해하고 해결하는 능력을 평가하는 벤치마크입니다.
SEEDBench_IMG : 다양한 이미지 기반 질문 응답을 통해 모델의 시각적 인식력과 추론 능력을 전반적으로 평가합니다.

라이선스

LFM2-VL 시리즈는 Liquid AI에서 배포한 LFM Open License v1.0을 따릅니다. 이 라이선스는 연구·비영리 목적 사용에는 제약이 거의 없지만, 상업적 사용에는 명확한 제한이 존재합니다.

[내용 요약]

비상업적 사용 (허용) : 연구, 개인 프로젝트, 교육, 비영리 기관에서의 실험 및 응용은 자유롭게 가능합니다.
소스 수정, 재배포, 파생 모델 제작 또한 허용됩니다.
상업적 사용 (제한적 허용) : LFM Open License v1.0에서는 연간 매출 1,000만 달러(약 140억 원) 미만의 개인 또는 기업에 한해서만 상업적 활용이 허용됩니다. 만약 해당 기준(Threshold)을 초과하는 법인 또는 조직일 경우, 별도의 라이선스 계약 없이 상업적 용도로 사용할 수 없습니다.
비영리 단체 예외 : 공익, 교육, 연구 목적의 비영리 기관(예: 대학, 재단)은 매출 기준과 관계없이 자유롭게 사용할 수 있습니다.

간단한 사용 예시

허깅페이스에서 공개된 LFM2-VL 모델을 로컬 환경에서 사용하는 방법에 대해 알아보겠습니다.

1. LFM2-VL 모델 다운로드

아래 Liquid AI의 허깅페이스 페이지에서, 사용자 환경에 맞는 모델을 선택하여 다운로드합니다.

Liquid AI 허깅페이스 : https://huggingface.co/collections/LiquidAI/lfm2-vl

2. 패키지 설치

다음 명령어를 통해 모델 실행에 필요한 패키지를 설치합니다.

pip install transformers pillow accelerate
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126

3. 코드 작성

아래 예제는 로컬 환경에서 LFM2-VL 모델과 이미지 파일을 불러와 추론을 수행하는 코드입니다.

# Python
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "Path/to/LFM2-VL-450M"  # 사용자 환경에 맞게 수정
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16"
)
processor = AutoProcessor.from_pretrained(model_id)

image = load_image("Path/to/image.png")  # 사용자 환경에 맞게 수정

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

# Generate Answer
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=200)

response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

4. 실행 결과

이번에는 각 버전의 모델을 사용하여 여러 이미지를 대상으로 테스트를 진행하였습니다. LFM2-VL 시리즈는 기본적으로 영어를 지원하는 모델이지만, 한국어 인식 및 응답이 가능한지도 함께 확인해보았습니다.

[예시 이미지]

[실행 결과]

비고	LFM2-VL-450M	LFM2-VL-1.6B	LFM2-VL-3B
사용한 VRAM	3.4 GB	7.1GB	8.5 ~ 9.5GB
추론 시간	약 5초	약 8초	약 12초
추론 결과1	This image contains two main sections: 1. A bar chart comparing the compression on the Fox benchmark and Omnidocbench. It shows the percentage of text tokens and vision tokens in per page (Ground-truth) for different encoder series. The bars are color-coded, with purple representing 64 vts (voxels) and blue representing 100 vts. 2. A scatter plot graph showing the performance on the Omnidocbench. It includes various metrics like average vision tokens per image, vision tokens less than 1000, vision tokens greater than 1500, and other encoders. The plot uses different colors for each metric, with orange representing other encoders. Both sections provide a visual comparison of different image processing techniques, including their effectiveness in compressing text and vision tokens while maintaining image quality.	This image contains two graphs comparing the performance of different encoders on two benchmarks: Fox and Omnidoc. The first graph (a) shows the compression performance on the Fox benchmark. It compares the performance of 64 vs. 100 tokens (left and right) across different text token ranges. The graph uses purple and blue bars to represent these two scenarios, with the right side showing the 100 vs. 64 tokens comparison. The second graph (b) displays the performance on the Omnidoc benchmark. It compares the average vision tokens per image across different encoder series. The graph uses red, green, blue, and yellow dots to represent different encoder series. Both graphs have a grid and axes labeled with percentages and numbers, respectively. The graphs provide a visual comparison of how different encoder configurations perform on these two image processing benchmarks.	This image contains two graphs related to text compression and model performance. The first graph (a) is titled "Compression on Fox benchmark" and shows a bar chart comparing compression ratios for different text token ranges. It includes data for 64 tokens, 100 tokens, and 200 tokens on the left side, with corresponding compression percentages on the right. The second graph (b) is titled "Performance on Omnidocbench" and displays a scatter plot comparing model performance metrics. It includes data points for various models, with different colors representing different encoder series. Both graphs provide visual representations of how different models perform in terms of compression efficiency and overall performance on specific benchmarks.
추론 결과2	This image is a collage of various elements related to image processing and computer vision. It contains: 1. A math problem involving the calculation of 3x3 and 7x2 2. A green plastic container labeled for food storage 3. A cartoon teacher in a classroom 4. A humorous meme about a teacher's reaction to a student's last nerve 5. A fire hydrant 6. A coffee mug with a logo 7. A field with people flying kites 8. A watermark indicating the source of the image 9. A caption describing the image as "grounding" and "OCR the image" The collage appears to be a mix of educational content, humor, and visual elements related to image processing techniques.	This image is a collage of six different pictures, each with a specific focus. The top left picture shows children writing math equations on a chalkboard. The top middle picture features a green plastic container with a yellow lid, labeled in Chinese characters. The top right picture is a close-up of a mug with a colorful design and the word "Potential" written on it. The bottom left picture depicts a red fire hydrant with a face drawn on it. The bottom middle picture shows a group of people flying a kite in a park. The bottom right picture is a close-up of a mug with the word "Bountiful" written on it. Each picture has a caption in Chinese, providing additional context or information about the image.	This image is a collage of various elements, including a chalkboard with math equations, a green plastic container with a yellow lid, a cartoon drawing of a teacher, a group of people flying a kite, a red fire hydrant, and a white mug with a colorful design. There are also text boxes with captions and descriptions for each element.
추론 결과3	The image shows a green highway sign with white text and symbols. It's mounted on a metal pole and set against a clear blue sky. The sign provides directions to two locations: Hadan and Sasang, both 150 meters away. There's also a smaller sign above it indicating "5" with an arrow pointing to the right. The sign is bilingual, with text in both English and Korean.	The image shows a highway sign in South Korea. It's a green rectangular sign with white text and arrows, indicating directions for different routes. The sign displays "Hadan" and "Sasang" as destinations, with an arrow pointing to the right. There's also a number "5" visible, likely indicating a route number. The sign provides information about distances to these destinations, with "Hadan" being 150 meters away and "Sasang" being 30 kilometers away. The sign is mounted on a metal pole and is set against a clear blue sky, suggesting it's a sunny day. This type of signage is typical for major highways in South Korea, helping drivers navigate the country's extensive road network.	The image shows a green highway sign with white text and symbols. The sign is mounted on a metal pole and displays information in both Korean and English. The main part of the sign indicates directions to Hadan and Sasang, with an arrow pointing to the right. It also shows that these destinations are 150 meters away. At the top of the sign, there's a smaller rectangular sign with the number 5, likely indicating the highway number. The background is a clear blue sky, suggesting it's a sunny day. The sign appears to be in good condition, with no visible damage or wear. This type of sign is commonly found on highways or major roads in South Korea, providing drivers with important information about upcoming destinations and distances.

LFM2-VL 시리즈는 모델 규모가 커질수록 시각 정보 이해 능력이 단계적으로 향상되는 경향을 보였습니다. LFM2-VL-450M은 객체 중심의 단순 인식에 머물렀으나, LFM2-VL-1.6B은 이미지 내 구조와 관계를 일부 파악하기 시작하였고, LFM2-VL-3B은 시각적 요소와 언어적 표현을 통합적으로 해석해 의미 단위의 추론까지 수행했습니다. 결과적으로 LFM2-VL은 모델 크기 확장에 따라 단순 인식에서 의미적 추론으로 발전하는 멀티모달 이해 구조를 구현한 것으로 보입니다.

[정리 요약]

LFM2-VL-450M → “이미지 속 개체 중심”의 기초 시각 이해형 모델
LFM2-VL-1.6B → 관계와 구성을 일부 파악하는 중간 단계 모델
LFM2-VL-3B → 문맥적, 개념적 통합이 가능한 고수준 멀티모달 추론 모델

이번에 공개된 LFM2-VL 시리즈는 굉장히 작은 모델임에도 불구하고 굉장히 빠른 추론속도와 정확한 값을 제공합니다. 게다가 이 시리즈에서 공개된 모델 버전 규모에 따라 시각·언어 통합 이해 능력이 점진적으로 발전함을 확인할 수 있었습니다. 특히 LFM2-VL은 이미지 내 객체 인식에 그치지 않고, 관계·맥락·의미를 함께 해석할 수 있는 수준으로 확장되었다는 점에서 비전-언어 모델의 실질적 진화를 잘 보여줍니다. 앞으로 이 모델은 시각 이해, 콘텐츠 분석, 멀티모달 챗봇 등 다양한 응용 분야에서 활용될 가능성이 높을 것으로 기대됩니다.

감사합니다. 😊

저작자표시 비영리 변경금지 (새창열림)

'AI 소식 > 오픈소스 AI 모델' 카테고리의 다른 글

[오픈소스 AI] 최고 성능의 OCR 모델, Chandra-OCR를 소개합니다. (1)	2025.11.03
[오픈소스 AI] 중국 MiniMax의 초거대 언어모델, MiniMax-M2를 소개합니다. (1)	2025.10.31
[오픈소스 AI] DeepSeek에서 공개한 이미지를 인식하는 AI 모델, DeepSeek-OCR을 소개합니다. (1)	2025.10.24
Z.AI GLM-4.6 공개 \| Claude Sonnet 4를 능가한 차세대 오픈소스 LLM (0)	2025.10.14
[오픈소스 AI] 멀티모달 AI 끝판왕? Qwen3-Omni-30B-A3B 기능·성능 총정리 (0)	2025.09.30

Marcus' Stroy

[오픈소스 AI] 1GB 미만의 초소형 비전-언어 모델, LFM2-VL을 소개합니다.

LFM2-VL 모델이란

주요 특징

벤치마크 성능

라이선스

간단한 사용 예시

'AI 소식 > 오픈소스 AI 모델' 카테고리의 다른 글

티스토리툴바

[오픈소스 AI] 1GB 미만의 초소형 비전-언어 모델, LFM2-VL을 소개합니다.

LFM2-VL 모델이란

주요 특징

벤치마크 성능

라이선스

간단한 사용 예시

'AI 소식 > 오픈소스 AI 모델' 카테고리의 다른 글

'AI 소식/오픈소스 AI 모델' Related Articles

티스토리툴바