News - Lightweight Multi-Scale Fusion Network for IGBT Ultrasound Tomography Image Segmentation

Thank you for visiting nature.com. The browser version you are using has limited CSS support. For the best experience, we recommend using a newer browser (or disabling compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we will display the site without styles and JavaScript.
Insulated-gate bipolar transistor (IGBT) is an important power semiconductor device whose internal structure integrity directly affects its electrical performance and long-term reliability. However, accurate semantic segmentation of IGBT ultrasound tomography images faces some difficulties, mainly due to the visual distortion caused by high-density noise and object distortion. To address these issues, this paper constructs a specialized IGBT ultrasound tomography (IUT) dataset using scanning acoustic microscopy (SAM) and proposes a lightweight multi-scale fusion network (LMFNet) to improve the segmentation accuracy and processing efficiency in ultrasound image analysis. LMFNet adopts a deep U-shaped encoder-decoder architecture, and the backbone network is designed with inverted residual blocks to optimize feature transfer while maintaining model compactness. In addition, we introduce two flexible plug-ins: a contextual feature fusion (CFF) module that efficiently integrates multi-scale context information into skip layers, and a multi-scale aggregation (MPA) module that focuses on extracting and merging multi-scale features in the bottleneck layer. Experimental results show that LMFNet performs well on the IUT dataset and significantly outperforms existing methods in terms of segmentation accuracy and lightweight model performance.
An insulated-gate bipolar transistor (IGBT) is a power semiconductor device that combines the high input resistance of a metal-oxide-semiconductor field-effect transistor (MOSFET) with the low on-state voltage drop of a bipolar junction transistor (BJT)1. Due to their excellent current handling and switching speed, IGBTs are widely used in inverters, motor drives, power management systems, and renewable energy systems2,3. However, as IGBT modules become more complex and their power density increases, the requirements for quality, reliability, and performance also increase4. Figure 1 shows a cross-section of an IGBT module, revealing its multilayer, non-uniform composite structure5.
During the manufacturing and operation of IGBT modules, the presence of defects can seriously affect the performance and stability of the system6,7. Therefore, it is very important to conduct effective quality control of IGBT modules. Due to the complex structure of IGBT modules and the fact that the key components are embedded in copper matrix and silicone, it is difficult to directly evaluate the internal conditions by traditional detection methods8. In this context, non-destructive testing (NDT) technology has become a research target, especially ultrasonic imaging technology, which has attracted attention due to its high resolution, low cost, and wide applicability9,10.
Although ultrasonic imaging technology can provide general structural information, it has limitations in accurately identifying the internal components of the IGBT module. These limitations are mainly due to the fact that traditional ultrasonic methods rely on signal processing technology combined with manual adjustment, which makes it difficult to accurately identify the internal components, especially in cases with complex structures. In addition, these methods often result in a long analysis process and a decrease in detection efficiency11,12. In recent years, semantic segmentation technology has been introduced into image processing and has attracted widespread attention due to its ability to classify and analyze images at the pixel level, providing a new idea to improve detection accuracy13,14.
Semantic segmentation is a computer vision technique that involves understanding and classifying image content at the pixel level. It assigns specific semantic labels to each pixel, segmenting the image into regions with different meanings, allowing for a more detailed understanding and analysis of the image. Traditional segmentation methods can be broadly divided into rule-based methods and model-based methods. Rule-based methods rely on manually defined features and rules, such as edge detection15, region growth16, and threshold segmentation17. Although these methods are simple and intuitive, they are ineffective in handling complex scenes and are sensitive to noise and image variations. Model-based approaches, such as support vector machines18 and random forests19, can improve segmentation accuracy, but often require extensive manual feature engineering and are computationally inefficient.
With the development of computer hardware and the rapid development of deep learning, segmentation methods based on convolutional neural networks (CNNs) have gradually become mainstream20,21,22. Compared with traditional rule-based and manual feature engineering methods, CNNs have significant advantages in segmentation accuracy and computational efficiency. The introduction of fully convolutional networks23 (FCNs) has laid the foundation for end-to-end pixel-level segmentation. However, relying solely on deep feature resampling often results in insufficient output resolution and loss of detail. To overcome these limitations, several advanced network architectures have emerged. UNet24 improves data granularity by using skip connections to transfer multi-scale information between the encoder and decoder. On this basis, UNet++25 introduces dense connections and deep supervision mechanisms to further promote feature reuse and multi-scale information fusion, greatly improving segmentation accuracy. Subsequently, PSPNet26 added a pyramid pooling module that can capture multi-scale contextual information and significantly improve segmentation performance in complex scenes. Despite these advances, these methods are often computationally and memory intensive.
Through deep analysis of the semantic segmentation task, we found that the quality of feature extraction directly affects the segmentation accuracy, and reducing the computation burden without sacrificing accuracy is a key objective of lightweight development. In recent years, various strategies have been proposed to address this problem. proposed depthwise separable convolution27, which decomposes the standard convolution into depthwise convolution and pointwise convolution, greatly reducing the number of parameters and computation while maintaining the model performance. proposed grouped convolution28, which groups the input channels for convolution operations, thereby reducing the number of parameters and computation burden. proposed a bottleneck framework29, which reduces the number of channels in the feature map by introducing a bottleneck layer between the convolutional layers, thereby reducing the model complexity. developed MobileNetV230, which achieved more efficient model deployment through pruning and quantization techniques. proposed DeepLabV3+31, which uses knowledge extraction to build a lighter semantic segmentation model. Modern network architectures have further improved the semantic segmentation task. For example, Menghao et al. proposed a SegNext segmentation network, which combines the advantages of Transformer and CNN and significantly improves the model performance and computational efficiency32. proposed TransUNet, which combines Transformer with U-Net to improve the segmentation accuracy of medical images33. Meanwhile, Ali et al. proposed SwinUNet, which combined the Swin Transformer and U-Net architectures to improve the accuracy and generalization ability of medical image segmentation34. Although these lightweight models have achieved significant improvements in computational efficiency, existing architectures still face challenges in handling noise and extracting detailed features when processing high-resolution ultrasound images, especially in complex industrial applications35,36,37.
Existing studies have also attempted to apply segmentation algorithms to IGBT image segmentation. Harn et al. combined the Sobel edge filter with a convolutional neural network to segment the copper layer38; Wenjie et al. used DeepLabv3+ to segment the bond wires and implemented a multi-layer hybrid bond wire defect detection method39; Yang et al. proposed DeepLabv3+-Lite to segment the ceramic substrate region and background of IGBTs40. However, there are still challenges in semantic segmentation of IGBT ultrasonic images, such as high-density noise and visual distortion caused by target deformation. For example, if the segmentation target is a rectangle, the micro-curvature effects may cause uneven color distribution, causing regions to disappear completely and complicating the segmentation task. In addition, small defects are often present at the edges of the target, especially at the boundaries between the edge and the background. It is difficult for the network to determine whether these regions belong to the target, which reduces the segmentation accuracy. To summarize, we found that existing network architectures have certain limitations in terms of lightweight design, multi-scale information extraction, and handling complex noise patterns and shape distortions41. Therefore, this paper proposes a lightweight network structure that integrates a multi-scale perceptual aggregation (MPA) module and a contextual feature fusion (CFF) module, which effectively improves the feature extraction accuracy while significantly reducing the model parameters and computational complexity, ultimately providing an accurate and efficient solution for semantic segmentation of IGBT module ultrasound images.
We propose a CFF module that combines group convolution with an attention mechanism. This design improves segmentation accuracy and reduces computational costs while preserving edge and texture details, which enables accurate localization of object boundaries at the decoding stage.
The MPA module is designed to capture multi-scale context information through parallel paths. In addition, it improves channel attention through frequency domain analysis, preserving details and improving the model’s ability to focus on significant features.
For semantic segmentation of IGBT ultrasound tomography images, LMFNet is proposed, which can effectively segment IGBT module components even in the presence of complex noise patterns and object distortions.
The SAM technology is used to generate the IGBT module tomography dataset, and CVAT is used to create pixel-level annotations to form a dedicated IGBT ultrasound tomography (IUT) dataset. The LMFNet model is then tested and compared with other state-of-the-art models. Experimental results show that LMFNet outperforms existing methods on the IUT dataset, showing significant improvements in segmentation accuracy and efficiency.
The overall architecture of the proposed LMFNet is shown in Figure 2. The model uses a deep U-shaped encoder-decoder structure for feature extraction and fusion.
In the encoder part, the backbone network adopts the inverted residual structure, which consists of a series of “convolution-depthwise separable convolution-convolution”, which reduces the model size while maintaining high segmentation performance. The input single-channel ultrasound image of the IGBT module is first expanded by pointwise convolution, and then downconvolved using the inverted residual structure to generate feature maps E1, E2, E3, and E4. Among them, E4 is the bottleneck layer containing the most abstract semantic information.
In the decoder, the feature map E4 is further extracted by the MPA module, and then generated by the CFF module to generate the combined feature map D4. This step improves the segmentation accuracy, especially the detail restoration effect. First, the combined features are convolved to extract local features, then pointwise convolution is performed to adjust the number of channels, and finally upsampling is performed to restore the spatial resolution corresponding to the feature map E3. This process is repeated for the feature maps D3, D2, and D1. Through layer-by-layer resampling and fusion, the decoder not only restores the high-resolution feature map, but also effectively fuses the features at different layers and finally outputs the segmentation result with the original resolution, achieving accurate segmentation of the ultrasound image while fully preserving the detailed information of the target region.
Skip connections are a key feature of deep U-networks. In this architecture, after each downsampling step, the feature map is copied, saved, and then concatenated with the upsampling result of the corresponding upsampling step, forming a U-shaped structure42,43,44,45. Skip connections effectively merge low-resolution and high-semantic features with high-resolution and low-semantic features, improving the segmentation accuracy and precision. However, in practice, it has been observed that although low-level feature maps contain rich spatial information, they often lack sufficient semantic understanding. In contrast, high-level feature maps, although rich in semantic information, have low spatial resolution and lack detail. In this case, direct connection may lead to inconsistency and information loss. To address this issue, we propose the Contextual Feature Fusion (CFF) module, which replaces the original skip connections structure and fuses low-level and high-level semantics.
The structure of CFF is shown in Figure 3. This module first combines low-level semantic features Flow and high-level semantic features Fhigh of the same size to form a combined feature map F∈RC×H×W to achieve feature fusion.
Next, the feature map F is divided into four groups according to the channel size for parallel processing to reduce the computational complexity of each operation and improve the overall efficiency. At the same time, the two branches FG1 and FG2 in each FG group simultaneously introduce a channel attention mechanism and a spatial attention mechanism, so that each mechanism focuses on capturing global and local information respectively, thereby avoiding capturing similar or interfering features in different dimensions. Then, the channels are shuffled 46 to obtain a new feature map T, which improves the interaction between different channels and further improves the feature fusion.
The channel attention mechanism extracts features through global averaging, then generates attention weights using a simple linear layer and sigmoid activation function, and applies the resulting attention weight map to the input feature map to emphasize the channel dimension.
The spatial attention mechanism extracts features through parallel averaging and max pooling, concatenates the concatenated feature maps, generates spatial attention weights through convolution and sigmoid activation functions, and applies the resulting attention map to the input feature map to extract the spatial dimension.
Finally, the feature map T generates attention weight matrices Wlow and Whigh, which are multiplied by the low-level and high-level features, respectively, to obtain a weighted feature map. All the weighted feature maps are concatenated with the original feature map to obtain the final output Fout, realizing a comprehensive fusion of detailed information and semantic information.
The bottleneck layer is an important bridge between the encoder and the decoder, and plays an important role in feature dimensionality reduction and information extraction. However, the traditional bottleneck layer has limitations in collecting global context information due to the limitation of the receptor field, and it is difficult to fully utilize the complementarity of features at different scales. Larger features can capture broader context information, while smaller-scale features support higher spatial resolution to capture finer details. In addition, a large amount of redundant information is often introduced into the bottleneck layer during the information transfer process, which affects the feature performance. To solve these problems, this paper proposes a multi-scale perceptual aggregation (MPA) module that can effectively aggregate multi-scale features. This module collects multi-scale context information through parallel branches and enhances channel attention through frequency domain analysis, improving the model’s attention to key features while preserving necessary details.
As shown in Figure 4, the MAP module consists of five parallel branches. Three branches use dilated convolutions with different dilation rates to capture multi-scale contextual information, allowing the network to expand its receptive field while maintaining the resolution of the feature map. Another branch introduces a frequency channel attention (FCA) mechanism to enhance channel attention by weighting the channels in the frequency domain. This allows it to capture both low-frequency global information and high-frequency local details, thereby improving the model’s perception of fine-grained features and global structures. Compared with traditional channel attention mechanisms (e.g., SE and ECA47,48), FCA has significant advantages in fine-grained feature extraction, small object detection, and background noise suppression, all without additional computational cost49. The output of FCA is fused with the feature maps produced by the first three dilated convolution branches, which improves the diversity and expressiveness of the feature maps. The final branch outputs the original input feature map, ensuring that key detailed information is preserved during feature fusion. These outputs are concatenated by channel size and passed through a pointwise convolution to adjust the number of channels and restore the original input size.
The first branch applies the FCA attention mechanism to enhance channel attention through frequency domain analysis. The input feature map X is divided into k blocks by channel size, denoted as [X0,X1,…,Xn−1]. Each block is assigned a 2D DCT component to obtain a processed frequency vector Freqk, thereby obtaining a total processed vector Freq.
Among them, u, v are the two-dimensional frequency components of the k-th part, and Y is the formula of discrete cosine transform.
The fully connected layer and sigmoid function generate attention weights, dynamically adjust the channel weights of the input feature map X, and generate a weighted feature map X′.
Branches two through four perform 3×3 dilated convolutions with different dilation rates r, then multiply the results by the output of the first branch and then add them together.
Finally, the outputs of all five branches are concatenated by channel size, after which pointwise convolution is performed to adjust the number of channels and restore the original input size, resulting in the final output value M.
This design allows the MPA module to effectively integrate multi-scale and multi-view feature information, significantly improving the network’s performance in semantic segmentation tasks, increasing the completeness and accuracy of feature expression, and improving the overall performance and robustness of the model.
The overall structure of the high-frequency ultrasonic scanning microscope used for data collection in the experiment is shown in Figure 5, where the signal generator is JSR DPR500 and the ultrasonic transducer is PT75-3-15 from Toray (frequency 75 MHz, vibrator diameter 3 mm, focal length 15 mm).
The ultrasonic transducer uses the single transmission and single reception method to vertically scan the bottom plate of the IGBT module. The obtained one-dimensional scanning signal is shown in Figure 6, which is the ultrasonic signal at the interface of each layer inside the IGBT module, where a and b are the upper and lower interface signals of the solder layer of the bottom plate, c is the interface signal of the ceramic substrate, d is the interface signal of the copper layer, and e is the interface signal of the chip.
As shown in Figure 7, the ultrasonic transducer performs a comprehensive scan of the IGBT module and matches the one-dimensional scan signal with the corresponding position on the two-dimensional plane according to the scan point. The amplitude of each signal point represents the ultrasonic reflection intensity at that position, and then tomography is performed to generate a grayscale ultrasonic image of the module.
The experiments were implemented using Python, and all models were trained on the Ubuntu operating system. The used version of the PyTorch GPU framework is based on the cudnn8.5 and cuda11.6 drivers. The hardware configuration includes an Intel® Core™ i7-12850HX processor, 64 GB of RAM, and an NVIDIA RTX A3000 graphics card with 6144 CUDA cores and 12 GB of GDDR6 video memory.
Ten IGBT modules from Infineon and Starpower were selected, with module sizes ranging from 38 mm to 162 mm and a ceramic substrate diameter of 0.8 mm. The modules were divided into three groups according to the number of through holes, as shown in Figure 8. High-frequency ultrasound scanning microscopy (C-Scan) was used to obtain high-resolution tomographic images of the modules, and 58 grayscale ultrasound images were acquired from each module, yielding a total of 580 grayscale images. Computer Vision Annotation Tool (CVAT) was used to perform pixel-level annotation of the chip area, copper area, ceramic substrate area and other areas in the grayscale tomographic images, yielding a total of 580 annotated images.
Data augmentation technology is used to generate new samples through various transformations and expand the training dataset, thereby improving the generalization ability and robustness of the model. As shown in Figure 9, the images were enhanced using flipping, rotation, adaptive histogram equalization, salt and pepper noise addition, grayscale logarithmic transformation, and median filtering methods, resulting in 37,120 enhanced modular images. Of these, 33,408 images are designated as the training set and 3,712 images are used as the testing set, which together constitute the IGBT Ultrasound Tomography (IUT) dataset.
Examples of data augmentation for tomographic images: (a) actual image, (b) original image, (c) horizontally flipped image, (d) vertically flipped image, (e) rotated image, (f) adaptive histogram equalization image, (g) salt-and-pepper noise image, (h) log-transformed grayscale image, and (i) median filtered image.
In this paper, a unified evaluation index is adopted to evaluate the performance of multi-class semantic segmentation models. The specific formula is as follows:
TP represents the number of correctly predicted class pixels, TN represents the number of correctly predicted non-class pixels, FP represents the number of non-class pixels incorrectly predicted as class pixels, and FN represents the number of class pixels incorrectly predicted as non-class pixels. On this basis, four key metrics are adopted: recall rate (Rec), precision rate (Prec), mean intersection over union (mIoU), and mean pixel precision (MPA). Recall measures the ability of a model to identify all positive cases, while precision reflects the accuracy of the model’s predictions. mIoU evaluates the effectiveness of class boundaries by computing the average intersection and union ratio of the predicted segmentation map and the ground truth. MPA calculates the average precision across all pixel categories, thereby comprehensively evaluating the performance of a model across all categories. Using these four metrics together provides a comprehensive and accurate evaluation of image segmentation models. mIoU and MPA complement each other and reveal the performance differences between different categories.
To comprehensively evaluate the effectiveness of the proposed module, we conducted ablation experiments on the IUT dataset using LMFNet. As shown in Table 1, the classical deep U-shaped model UNet is selected as the baseline model and achieves 88.04% mIoU.
We first introduce an inverted residual (IR) structure with depthwise separable convolution into the backbone network, reducing the number of model parameters from 24.89 million to 1.57 million. It is worth noting that the segmentation performance decreases only slightly: mIoU decreases by only 0.25%. This indicates that the IR structure effectively maintains the segmentation performance while significantly reducing the model size.
Next, we further improved the model by adding MPA and CFF modules. Experimental results show that these additions improve the mIoU to 89.83% and 88.68%, respectively. This shows that MPA and CFF modules can effectively improve segmentation accuracy with minimal impact on the number of parameters.
Finally, we tested the full LMFNet model, including all proposed modules. The mIoU score reaches 91.12%, which outperforms all previous improved models. Compared to the baseline model, this means an improvement in segmentation accuracy of 3.08% while reducing the number of parameters by about 90%.
To comprehensively evaluate the performance of LMFNet and its individual modules, the normalized confusion matrix is presented in Figure 10. This matrix not only shows the overall classification accuracy across categories, but also highlights potential confusions that the model may encounter when distinguishing between categories. Analysis of the confusion matrix shows that although LMFNet achieves the highest segmentation accuracy across all categories, it has difficulty accurately segmenting long and thin labels such as “ceramic substrate”. This insight is critical to purposefully adjusting the network architecture and training strategy to further improve the overall segmentation accuracy of the model.
Figure 11 shows the depth ablation experiment conducted on the entire LMFNet network, where figures a, b, c, and d represent the network structures of 3-6 layers. The purpose of this experiment is to study the impact of different network depths on the segmentation performance in order to reduce the complexity and computational load of the network without sacrificing performance, thereby improving efficiency.
The results in Table 2 show that among the network structures with 3–6 layers, the best results are achieved using a 5-layer network. Interestingly, although the 6-layer network has more parameters, the segmentation accuracy decreases. This shows that a deeper network does not necessarily mean better performance, and maintaining an optimal number of layers is the key to achieving high segmentation accuracy. The reasons for the decrease may include overfitting, gradient vanishing or exploding, interference from redundant information, and other factors.
The MPA module uses dilated convolutions with different dilation rates to capture multi-scale context information. To evaluate the impact of the number of dilated convolution branches and the dilation rate on segmentation performance, ablation experiments were conducted using different dilation rate combinations. The results are presented in Table 3, which shows that the combination (3, 12, 24) achieves the best results in all metrics. Moderate dilation rates optimally balance local details and global context, thereby improving segmentation accuracy, especially in edge regions. In particular, a dilation rate of 3 allows for smaller-scale features to be captured, a dilation rate of 12 effectively captures mid-scale features, and a dilation rate of 24 captures larger-scale context. This fusion of multi-scale features improves the robustness of the model when handling objects of different sizes. In addition, using multiples of 3 (e.g., 3, 12, 24) ensures uniformity and symmetry in the expansion of the receptive field, avoids the problem of non-uniform expansion that occurs when using numbers that are not multiples of 3 (e.g., 5 or 7), and stabilizes the learning of multi-scale features. Compared with other combinations, (3, 12, 24) minimizes the computational complexity while maintaining high segmentation accuracy. In contrast, combinations with additional branches such as (3, 6, 12, 18, 24) may lead to information redundancy and waste of computational resources without providing significant performance improvement.
Channel attention enhances the representation of important channels by assigning different weights to them, while spatial attention emphasizes key spatial locations by applying different weights. To evaluate the impact of different combinations of these two attention mechanisms in the CFF module on the model performance, an ablation study was conducted to compare three attention combination methods: sequential channel-spatial, sequential spatial-channel, and split-parallel. As shown in Table 4, the experimental results show that the split-parallel attention combination method outperforms the sequential method in terms of accuracy. In addition, the split-parallel approach achieves higher computational efficiency and less memory usage. This may be due to the fact that channel and spatial attention operate independently on different feature dimensions, avoiding redundant computation on the same dimension, thereby greatly reducing the computation and memory consumption. Moreover, the independent operation of channel and spatial attention makes information fusion more concise and efficient without the need for sequential processing as in connection-based methods. This makes the split-parallel approach more suitable for lightweight design requirements.
To further verify the effectiveness of LMFNet on the IGBT module segmentation task, we conducted experiments on the IUT dataset and compared LMFNet with other state-of-the-art models. According to the experimental results presented in Table 5, LMFNet shows significant advantages in both segmentation accuracy and efficiency. Notably, LMFNet achieves an impressive mIoU of 91.12% using only 2.26 million parameters.
In order to clearly demonstrate the segmentation effect of each model on different categories of IGBT modules, three actual segmentation results are analyzed in Figure 12. The results show that due to the simple structure of FCN and the lack of sufficient context information and detailed features, the segmentation effect of complex shapes and thin structures of IGBT modules is weak. Although PSPNet integrates global context information, it still lacks a robust deep feature fusion mechanism. In contrast, other models using skip connection structures or depthwise separable convolutions can capture more context information and thus improve the segmentation results. However, a closer inspection reveals some shortcomings. For the eight-hole module, U-Net and U-Net++ misclassified the thin “copper layer” between the chips as “ceramic substrate”, while DeepLabV3+ and SegNext misidentified it as “chip”. For the four-hole module, DeepLabV3+ performs poorly in segmenting the thin “ceramic substrate”. For the two-hole module, several models fail to accurately identify the small “ceramic substrate” and mistake it for a defect or noise. In contrast, Transformer-based models such as SwinUnet and TransUnet, as well as LMFNet, which integrates multi-scale perceptual aggregation (MPA) and contextual feature fusion (CFF) modules, perform well in segmenting the thin and complex structure of IGBT modules (Figure 12).

Post time: Feb-12-2025