French Researchers Release 14,636-Leaf Spectral Dataset to Advance Early Detection of Deadly Grapevine Diseases

2025-12-17

Open-access data from 2020–2024 aims to boost automated diagnosis in Chardonnay vineyards and support global precision agriculture efforts

Researchers in France have released a comprehensive, multi-year spectral dataset aimed at improving the detection of grapevine yellows diseases in Chardonnay vineyards. The dataset, which covers the years 2020 to 2024, was collected at the Comité Champagne experimental site in Plumecoq and a neighboring plot. It includes 14,636 spectra from grapevine leaves, representing five classes: healthy, yellows (which groups Flavescence dorée and Bois noir), leafroll, esca, and discoloration.

Grapevine yellows are a group of phytoplasma-induced diseases that pose a major threat to vineyards worldwide. In Europe, Flavescence dorée (FD) and Bois noir (BN) are the most significant. These diseases can cause severe symptoms such as leaf yellowing and rolling, poor shoot lignification, incomplete grape ripening, and eventual vine death. The economic impact is high due to mandatory quarantine measures and the need to replant affected vines. Early detection is critical because infected vines can spread the disease for at least a year before showing visible symptoms.

Chardonnay is particularly vulnerable to these diseases. Its symptoms often resemble those caused by other conditions like Leafroll (a viral disease), Esca (a fungal wood disease), and various types of discoloration linked to nutrient deficiencies or environmental stress. This overlap makes visual diagnosis difficult and labor-intensive. Laboratory tests such as PCR are usually required to distinguish between FD and BN, but these are not practical for large-scale vineyard monitoring.

To address these challenges, the research team collected spectral data from leaves each September or October during the harvest period. The leaves were sampled from six zones within the experimental site and a neighboring vineyard with different management practices. For each plant, leaves were taken from both apical (top) and median (middle) positions. The samples were quickly transported to a laboratory where spectra were recorded using a LabSpec® 4i ASD spectrometer with a contact probe. This setup allowed for high-resolution measurements across wavelengths from 350 nm to 2500 nm.

The dataset captures variability across years with different weather conditions and phytosanitary treatments. For example, 2020 had very healthy foliage; 2021 saw heavy rainfall and visible residues from treatments; 2022 was hot and dry with brown spots on leaves; 2023 had powdery mildew; and 2024 was marked by downy mildew causing leaf scorch.

Initial analysis of the data included L2 normalization of spectra to correct for magnitude differences while preserving their shape. Average spectra showed that certain wavelengths—especially around 550 nm, 730 nm, 1400 nm, 1900 nm, and 2200 nm—may help distinguish between classes. However, there remains significant spectral similarity among yellows, esca, and leafroll classes.

Principal Component Analysis (PCA) was used to reduce the dimensionality of the data for visualization and further analysis. The first three principal components accounted for about 88% of total variance in the dataset. PC1 was most influenced by wavelengths associated with water content and cell structure; PC2 captured changes related to pigments like chlorophyll; PC3 highlighted more subtle variations that could be important for detecting early-stage symptoms or localized discoloration.

Despite these findings, PCA projections showed considerable overlap between classes, confirming that distinguishing between them based solely on linear spectral features is difficult. This highlights the need for more advanced algorithms or feature selection methods to improve classification accuracy.

The full dataset is publicly available through the Recherche Data Gouv repository (DOI: 10.57745/KPNOJL). It includes not only the spectral measurements but also metadata on year, zone, leaf position, and class labels. To support reproducibility and further research, Python code in a Jupyter Notebook is provided on GitHub (https://github.com/zsr1997/Scientific-Data). This code allows users to access, visualize, and analyze the data.

The release of this dataset is expected to accelerate research into automated detection methods for grapevine diseases using spectral analysis. By providing data collected under diverse environmental conditions over several years, it offers a robust foundation for developing models that are less sensitive to annual variability or local vineyard practices.

Looking ahead, the researchers plan to expand this resource by adding multispectral images acquired under controlled conditions from the same sites. Integrating spatial information with spectral data could further improve disease detection models by capturing both biochemical changes in leaves and their spatial distribution on plants.

This initiative reflects growing interest in precision agriculture tools that can help vineyard managers detect diseases earlier and more accurately than traditional visual inspections allow. With open access to both data and analysis tools, researchers worldwide can contribute to developing scalable solutions for protecting vineyards against devastating diseases like Flavescence dorée and Bois noir.