Translation

Can AI Describe Art as We Do? A Case Study on a Pottery Collection

Can AI Describe Art as We Do? A Case Study on a Pottery Collection

Henna D. Bhramdat, Borui Zhang, Nicolas Gauthier

There are two capabilities of current large language models (LLM)-based AI systems that we attempt to evaluate for improving the discoverability of library and museum collections, which are often searched by using expert-defined keyword vocabularies through complex hierarchical categories: 1) Vector search: differing from the traditional keyword search, it improves discovery of word semantic relationships in a broader natural language domain and 2) Multimodal large language models (MLLM):  combining computer vision processing images alongside LLMs, boosting understanding of the image both textually and visually. We explore how visual language models (VLM) and MLLMs can bridge vocabulary gaps in search between expert-generated descriptions and the public.

In this study, we tested a variety of Language-Vision Library (LAVIS) model configurations supported by two MLLMs, BLIP and CLIP, and GPT-4 Vision from OpenAI to answer visual questions on a set of digitized historical pottery from the Digital Ceramic Type Collection (3,127 pottery sherds) at the Florida Museum of Natural History. For example, the piece below. Originating from Mexico City, Metro Santa Maria, it represents the polychrome, coarse earthenware type. Expert description identifies the presence of a painted floral design with a ring at the center. Colors include crude off-white, cream, buff yellow, orange, green, and black. We explore how a human subject—for example, a freshman student—might describe an abstract art piece like this, and whether a search system can retrieve related works from a novice user’s description.

—Word choices differed markedly across human subject descriptions, the LAVIS-based Model, and GPT-4 Vision under different settings—

pottery imgWe began with cleaning the images to remove visual artifacts and normalize the metadata. We cleaned the images using Grounding DINO with Segmenting Anything (SAM) to identify the unwanted elements (scale bar and label) and remove them, leaving only the image of the pottery against a plain background. We consolidated the image description and metadata text into a compact, embedding-ready format, removing all non-semantic administrative and file-management fields.

We sampled 30 pottery images from the full set and developed a general prompt question for novice subjects to compare outputs from the captioning mode and the VQA mode of LAVIS models. Model outputs were assessed for clarity, accuracy, and relevance, leading us to identify the Large COCO captioning model combined with A-OKVQA as the most effective on cleaned images.

Building on these results, we refined and expanded our question prompts into richer, more contextually appropriate responses from public users. This phase incorporated survey outputs from the selected LAVIS models and GPT-4 Vision. Finally, we conducted a comparative analysis of model- and human-generated descriptions.

What we see, what models say

Word choices differed markedly across human subject descriptions, the LAVIS-based Model, and GPT-4 Vision under different settings. Human participants more frequently used descriptive adjectives such as shape and color. The LAVIS Models showed the strongest bias toward object-specific terms, particularly “tree”  and “decoration.” GPT-4 Vision at default temperature, for example, disproportionately used the word “animal,” nearly double its occurrence at the reduced temperature, and far exceeding human and LAVIS frequencies.

GPT-4 Vision model appears to be an intermediate between human and LAVIS responses compared to other models. It appears to share more in common with the
survey responses and better mimics a human-like open-ended response. For the example sherd above, human subjects often use “starburst” or “sun” design with “radiating” or “swirly” lines. LAVIS and GPT-4 Vision also describe the pottery as having a “radiating” or “sunburst” design. Many aspects of their responses resemble a human subject’s description of the pottery. However, neither GPT-4 nor the human responses describe the “floral design” given in the expert metadata description, even when asked specifically if the design includes plant patterns. LAVIS is the only model that likened the design to a “sunflower” or “flower” when prompted to identify plant motifs. In this case, the survey responses do not align with the expert-given description; using an LLM may be effective in bridging the gap between the expert descriptions and novice observations.

When the original images were used (before cleaning), the model inconsistently recognized the presence of a scale bar and image ID number. In such instances, the captions generated noted the presence of a number label and/or scale bar in the description. One example was “a stone with a face on it next to a ruler.” The presence of these features generated inconsistent responses and offered no information beyond noting the presence of the image elements. In several instances, the captioning models in original and edited images characterize the background (i.e., “…black background,” “…on a table,” “…on a dark surface”). After multiple tests, it is clear that LAVIS is sensitive to image input quality from the original and edited images.

Our observations led us to use the edited pottery images so that the focus is placed on the details of the pottery fragment. Within LAVIS, Large COCO and Base COCO generate captions of similar quality. While the word content differed between the two, the overall context, message, and meaning were similar. For some images, Large COCO offered responses that seemed more relevant to the image. Additionally, when testing the answering models in LAVIS, A-OKVQA, and VQAv2 produced very different results and rarely converged to the same answer. VQAv2 frequently produced non-answers to the questions, such as “none,” “nothing,” “don’t know,” etc. A-OKVQA rarely returned these responses; it returned its best relevant answer, even if it did not fit the image or the question.

As practitioners working with MLLMs, we recognize our field is evolving. These models can build richer semantic relationships and help novice users engage with collections without being impeded by specialized vocabulary. Yet our work exposes something equally important: the interpretive work required to evaluate model outputs is itself a form of expertise. Future work with AI requires cross-disciplinary collaboration, where domain experts are not merely end users but active co-designers of the loop itself.

Cite this article in APA as: Bhramdat, H. D., Zhang, B., & Gauthier, N. (2026, May 1). Can AI describe art as we do? A case study on a pottery collection. Information Matters. https://informationmatters.org/2026/04/can-ai-describe-art-as-we-do-a-case-study-on-a-pottery-collection/

Authors