Leveraging Vision-language Models for Risks Assessment
2024
Rodríguez Juan, Javier | Rodríguez Juan, Javier | Garcia-Rodriguez, Jose | Tomás, David | Universidad de Alicante. Departamento de Tecnología Informática y Computación | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
The growth in the number of dependent people in the world is a problem reported by prestigious organisations such as the World Health Organisation, which estimates that the number of dependent people, such as the elderly, will reach 2.1 billion by 2050, doubling the current figure of 1 billion. Currently, these people require caregivers in their homes to help them with daily activities and to protect them from the various risks that exist in the home environment. In order to increase the dependency and improve the quality of life of these individuals in their daily lives, this master’s thesis aims to develop a risk assessment architecture capable of identifying risky situations in households. In spite of this fact, the framework has been constructed taking into consideration its possible application in any other context, since it is possible to adapt the framework to other environments (e.g. construction or industry) just by fine-tuning the architecture with the appropriate data. Within this thesis, the developed risk assessment framework will be able to extract contextual and human information from the given scene using an ensemble of different deep learning models. This ensemble extracts the objects, the location of the person, the age group of the person and the actions being performed by that person. All this information is then combined to generate a description, which is classified by a fine-tuned sequence classifier and outputs a risk from the description. The framework will mainly consist of multimodal models that combine textual prompts with visual information contained in videos. Textual prompts are used to bias the models towards the objective information of the scene to be extracted. We used Charades dataset during training alongside ETRI-Activity3D. While the former preserves the spontaneity of household and daily activities, the latter helps us extend the age range of the participants. From Charades’ annotations we are able to extract verb-noun pairs that are used to propose a novel risk assessment dataset comprising scene descriptions with an associated risk. Descriptions are generated from the previously extracted contextual and human information and are enhanced with GPT3.5 to add variability.
显示更多 [+] 显示较少 [-]