How to show progress bar while unzipping tons of files
The original article is here.
In the machine learning field, there is plenty of public dataset for model training. Usually, such a dataset is provided as a zip archive file, so we can just download it, unarchive it with our good old friend unzip:
$ unzip /home/data/large_dataset.zip
But if you are working in Jupyter notebook, simple unzip command might hurt your screen with too much output and you’ll soon notice the performance of frontend UI is decreasing as the output increasing.
!unzip /home/data/large_dataset.zip -d /home/data/
Archive: /home/data/large_dataset.zip
inflating: /home/data/large_dataset/1/1900_753325_0060.png
inflating: /home/data/large_dataset/1/1900_754949_0023.png
inflating: /home/data/large_dataset/1/1900_758495_0075.png
inflating: /home/data/large_dataset/1/1900_761460_0029.png
inflating: /home/data/large_dataset/1/1900_766994_0030.png
inflating: /home/data/large_dataset/1/1900_776319_0015.png
...
(More 80K lines)
So people often shut it up with output redirecting:
!unzip /home/data/large_dataset.zip -d /home/data/ > /dev/null
or short -q option:
!unzip -q /home/data/large_dataset.zip -d /home/data/
Got it! The unzip command now silently unarchive files, no performance penalty, everything alright, just waiting … (few minutes passed); How soon can I expect unzip to finish?
pv command is rescue!
To prevent unzip command from making us anxious or sleepy, we want to see a progress bar which periodically reports an indication while unarchive. The pv command solves this problem, it can display a progress bar from any command’s output. As tried some times, finally I got an expected result:
n_files = !unzip -l /home/data/large_dataset.zip | grep .png | wc -l
!unzip -o /home/data/large_dataset.zip -d /home/data/ | pv -l -s {n_files[0]} > /dev/null
...
80.1k 0:00:09 [8.54k/s] [====================================>] 100%
Nice! Very helpful. What’s going on here is:
- Run
unzip -lto take the number of files to process. We have to filter lines withgrepand count lines bywc -lbecauseunzip -loutput includes directories. - Pipe the output of
unzipand pass the number of files to-s (size)option to show meaningful indicator.
That’s all! It’s about time to return to my notebook and try some experiments 😋