How to show progress bar while unzipping tons of files
The original article is here.

In the machine learning field, there is plenty of public dataset for model training. Usually, such a dataset is provided as a zip archive file, so we can just download it, unarchive it with our good old friend unzip
:
$ unzip /home/data/large_dataset.zip
But if you are working in Jupyter notebook, simple unzip command might hurt your screen with too much output and you’ll soon notice the performance of frontend UI is decreasing as the output increasing.
!unzip /home/data/large_dataset.zip -d /home/data/
Archive: /home/data/large_dataset.zip
inflating: /home/data/large_dataset/1/1900_753325_0060.png
inflating: /home/data/large_dataset/1/1900_754949_0023.png
inflating: /home/data/large_dataset/1/1900_758495_0075.png
inflating: /home/data/large_dataset/1/1900_761460_0029.png
inflating: /home/data/large_dataset/1/1900_766994_0030.png
inflating: /home/data/large_dataset/1/1900_776319_0015.png
...
(More 80K lines)
So people often shut it up with output redirecting:
!unzip /home/data/large_dataset.zip -d /home/data/ > /dev/null
or short -q
option:
!unzip -q /home/data/large_dataset.zip -d /home/data/
Got it! The unzip
command now silently unarchive files, no performance penalty, everything alright, just waiting … (few minutes passed); How soon can I expect unzip to finish?
pv
command is rescue!
To prevent unzip command from making us anxious or sleepy, we want to see a progress bar which periodically reports an indication while unarchive. The pv
command solves this problem, it can display a progress bar from any command’s output. As tried some times, finally I got an expected result:
n_files = !unzip -l /home/data/large_dataset.zip | grep .png | wc -l
!unzip -o /home/data/large_dataset.zip -d /home/data/ | pv -l -s {n_files[0]} > /dev/null
...
80.1k 0:00:09 [8.54k/s] [====================================>] 100%
Nice! Very helpful. What’s going on here is:
- Run
unzip -l
to take the number of files to process. We have to filter lines withgrep
and count lines bywc -l
becauseunzip -l
output includes directories. - Pipe the output of
unzip
and pass the number of files to-s (size)
option to show meaningful indicator.
That’s all! It’s about time to return to my notebook and try some experiments 😋