VINDEX

Exploring The Visual Feature Space for Multimodal Neural Decoding

Weihao Xia Cengiz Öztireli

University of Cambridge

Method

Our method builds upon UMBRAE, which aligns brain activity with image features for zero-shot captioning and grounding. The learned brain encoder maps brain activations into an image feature space, which is then processed through an adapter and an LLM for downstream tasks. We further explore the impact of different image feature spaces on fine-grained multimodal decoding, drawing inspiration from prior MLLM research that leverage different feature representations.

We investigate four types of image feature spaces: (a) Single Encoder (SE): uses features from a single pre-trained vision encoder (e.g., CLIP) for brain alignment—commonly used in MLLMs.; (b) Mixture of Encoders (ME): integrates features from multiple task-specific vision experts such as CLIP and DINO; (c) Aggregated Feature (AF): combines dense features from various layers (shallow, middle, and deep) to capture different image characteristics.; (d) Nested Features (NF): uses a variable-length set of visual tokens, downscaled using multiple downsampling factors to produce a hierarchical, nested representation—encoding visual content from coarse to fine-grained details.

Advantages:

Versatile and model-agnostic: VINDEX is a flexible brain decoding framework that can be easily adapted to various subjects, image sizes, visual encoders, MLLM settings, input prompts, and downstream tasks. The single trained encoder can be applied to all LLaVA-based variants due to architectural consistency.
Zero-shot multimodal decoding: VINDEX performs zero-shot multimodal brain decoding by leveraging diverse visual feature spaces from vision expert encoders, enabling brain decoding at multiple levels of detail.
New evaluation metric: We decompose decoded captions into objects, attributes, and relationships for evaluation. Semantic discrepancies are reflected in precision and recall: precision penalizes hallucinations and recall captures omissions, , offering a clearer view of what is genuinely decoded from brain activity.

Benchmark

MG-BrainDUB (Multi-Granularity Brain Descriptive Understanding Benchmark) contains two tasks: (a) detailed captioning and (b) salient question answering, each with corresponding metrics.

Descriptive Caption Understanding

Motivation: (a) Existing studies are limited to coarse interpretations, lacking essential details such as object descriptions, locations, attributes, and their relationships. This results in imprecise and ambiguous reconstructions when these cues are used for visual decoding. (b) Evaluating brain-decoded detailed captions—often hundreds of words long—poses a challenge, as conventional metrics become unsuitable. Traditional captioning metrics, which mostly rely on n-gram matching between candidate and reference captions, are highly sensitive to stylistic variations, leading to inconsistent evaluations. Model-based metrics either depend on outdated text encoders or struggle with long texts due to token length limitations. (c) Decoding methods that rely on pretrained models benefit from powerful generative capabilities but are also vulnerable to systematic biases. They often introduce hallucinated details, such as prototypical co-occurrence patterns, raising concerns about what is genuinely decoded from brain activity versus what is hallucinated by the generative model.

Our strategies: (a) we investigate brain alignment across different image feature spaces to identify informative representations; (b) we decompose decoded captions into objects, attributes, and relations for evaluation. This enables us to assess alignment at a finer semantic granularity and evaluate which types of information are truly recoverable from brain activity. Such semantic discrepancies are reflected in precision and recall: precision penalizes hallucinations, while recall captures omissions. Together, they reduce the confounding effects of hallucination from pretrained models and provide a more accurate measure of true information decoding capacity.

Salient Question Answering

Motivation: (a) effective brain decoding methods should be capable of understanding the intricate information embedded in brain signals, but current approaches lack proper evaluation of this ability; (b) brain signal data are often noisy or error-prone due to distractions or fatigue during the recording process—especially when decoding fine-grained details such as small objects.

Our Strategies: (a) We design a salient and complex reasoning task to evaluate the model’s ability to recognize objects, understand their interactions within the image, and answer in-depth reasoning questions that require step-by-step logical processing. Since COCO often overlooks salient categories beyond its 80 common classes, we use SAM and MLLMs to re-identify the main scene or key objects in each test image.

Results

Concise Captioning

Descriptive Captioning

(NF_n refers to the nested feature variant where the token size n is selected from {1, 9, 36, 144} after downsampling.)

Key Observations:

Direct alignment with image features enables zero-shot decoding and demonstrates improved performance in concise captioning, detailed captioning, and grounding tasks.
Previous captioning metrics fail to effectively evaluate open-vocabulary, detailed descriptions. Our proposed metric addresses both hallucination and omission, providing a more accurate assessment of the model’s fine-grained and genuinely decoded information.
While aligning LLMs with complex image feature spaces can improve performance in image-based tasks for MLLMs, brain decoding faces a trade-off: some information present in images may not be captured in brain data, making alignment with salient features more effective.

(14 words) A very small bathroom stall with a toilet and several rolls of toilet paper.

(57 words) The image features a bathroom with a white toilet situated in the middle of the room. The toilet is surrounded by a tiled floor, which gives the bathroom a clean and modern appearance. The walls of the bathroom are also tiled, adding to the overall cohesive design. The bathroom appears to be well-maintained and ready for use.

(40 words) The image features a white bathroom with a toilet and a sink. The toilet is located on the left side of the bathroom, while the sink is situated on the right side. The bathroom appears to be clean and well-maintained.

(97 words) The image features a bathroom with a white toilet situated in the corner of the room. The toilet is positioned under a window, which allows natural light to enter the space. The bathroom also has a sink, which is located towards the right side of the room. There are two bottles in the bathroom, one placed near the sink and the other closer to the toilet. Additionally, there is a cup on the left side of the room, and a bowl can be seen near the sink. The overall appearance of the bathroom is clean and well-organized.

(10 words) A very tall clock tower with two clocks on it.

(79 words) The image features a tall, old building with a clock tower on top. The clock is prominently displayed on the side of the building, making it a focal point. The building has a steeple-like structure, giving it a unique and historic appearance. The clock is positioned towards the center of the building, drawing attention to its presence. The overall architecture of the building suggests that it may be a church or a similar type of structure with historical significance.

(86 words) The image features a large building with a clock tower, which is situated in the middle of a city. The clock tower is visible on the left side of the building, and the building itself is quite tall. The scene is set against a backdrop of trees, creating a picturesque view. The trees are scattered throughout the scene, with some located near the building and others further away. The combination of the clock tower, the building, and the trees creates a visually appealing and urban landscape.

(79 words) The image features a large, old-fashioned clock tower with a steeple, situated on top of a building. The clock is prominently displayed on the tower, making it a focal point of the scene. The tower is surrounded by a group of trees, creating a picturesque setting. There are several people in the scene, with some standing closer to the clock tower and others further away. They appear to be enjoying the view of the tower and the surrounding environment.

(10 words) A large giraffe walking across a dirt lot near trees.

(75 words) The image features a large giraffe standing in a grassy field. The giraffe is the main focus of the scene, and it appears to be walking through the field. The field is lush and green, providing a natural habitat for the giraffe. The giraffe's long neck and legs are clearly visible, showcasing its unique features. The overall atmosphere of the image is serene and peaceful, capturing the beauty of the giraffe in its natural environment.

(52 words) The image features a herd of zebras grazing in a grassy field. There are at least 13 zebras visible in the scene, scattered throughout the field. Some zebras are closer to the foreground, while others are further in the background. The zebras are peacefully eating grass, creating a serene and natural atmosphere.

(46 words) The image features a large giraffe standing in a grassy field, surrounded by trees. The giraffe appears to be walking through the grass, possibly in search of food. The scene is set in a natural environment, with the giraffe being the main focus of the image.

(12 words) A red and yellow double decker bus on street next to trees.

(115 words) The image depicts a busy city street scene with several cars and buses driving down the road. There are multiple cars in various positions, some closer to the foreground and others further back. A bus is also visible in the middle of the scene, adding to the traffic. In addition to the vehicles, there are several people walking along the sidewalk, going about their daily activities. Some of them are closer to the foreground, while others are further back in the scene. The street is lined with trees, providing a touch of greenery to the urban environment. The combination of the bustling traffic and the presence of pedestrians creates a lively atmosphere in the city.

(54 words) The image shows a city street with several cars parked along the side of the road. The cars are of various sizes and are parked in a row. The street appears to be empty, with no people visible in the scene. The cars are parked in a line, creating an organized and orderly appearance.

(104 words) The image depicts a city street with a white car parked on the side of the road. The car is positioned near the center of the scene, and it appears to be a compact vehicle. There are several other cars parked along the street, with some closer to the foreground and others further in the background. In addition to the cars, there are two people visible in the scene. One person is standing near the left side of the image, while the other person is located closer to the center. The street is lined with trees, providing a pleasant atmosphere for the city setting.

(10 words) A baseball player holding a bat next to home plate.

(56 words) The image depicts a man wearing a baseball uniform, standing on a field with a baseball glove on his hand. He appears to be a baseball player, possibly waiting for a pitch or preparing to catch a ball. The scene takes place on a baseball field, with the man being the main focus of the image.

(69 words) The image features a tennis court with a tennis player in action. The player is holding a tennis racket and is in the middle of a swing, likely returning a volley. The tennis player is positioned towards the left side of the court. The court is surrounded by a fence, and there are several people in the background, possibly watching the game or waiting for their turn to play.

(101 words) The image features a baseball field with several baseball players standing on the field. There are at least nine people visible in the scene, with some of them closer to the foreground and others further in the background. A baseball glove can be seen on the ground, indicating that the players are either preparing for a game or have just finished one. The players are spread out across the field, with some standing closer to the center and others near the edges. The overall atmosphere of the scene suggests that the players are engaged in a casual or recreational baseball game.

Acknowledgements

This project was supported by a UKRI Future Leaders Fellowship and also was undertaken with the support received through the Imminent Research Grants—an initiative established to foster innovative language research—by Imminent, the research center of Translated, a pioneering company in AI-driven language technologies dedicated to enabling seamless global communication.

BibTex

 
    @inproceedings{xia2025vindex,

    title={Exploring The Visual Feature Space for Multimodal Neural Decoding},

    author={Xia, Weihao and Öztireli, Cengiz},

    journal={International Conference on Computer Vision (ICCV)},

    year={2025}

  }