How to Process Data in Batches in Python
There are some situations when you have a huge list of items to process but you cannot do them in one go due to some limitations of the systems that process the list.
Some examples:
- When you need to access an API that supports only 100 items at once in the request, you need to split your original list into lists of 100 items & combine the results.
- You have a long list of items that you want to process parallely. You can split them into the number of sub processes that you want & process them independently.
long_list = list(range(100))
sub_list_length = 10
sub_lists = [
long_list[i : i + sub_list_length]
for i in range(0, len(long_list), sub_list_length)
]
Let us try to break down the code
long_list
is a list of 100 numbers. We are splitting this list of numbers into sub lists specified by the sub_list_length
of 10. The list comprehension is relying on slices of the original list.
print(long_list)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
print(sub_lists)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47, 48, 49], [50, 51, 52, 53, 54, 55, 56, 57, 58, 59], [60, 61, 62, 63, 64, 65, 66, 67, 68, 69], [70, 71, 72, 73, 74, 75, 76, 77, 78, 79], [80, 81, 82, 83, 84, 85, 86, 87, 88, 89], [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]
print(list(range(0, len(long_list), sub_list_length)))
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
Processing the Sub Lists
results = []
for sub_list in sub_lists:
partial_result = process_function(sub_list)
results.append(partial_result)
Here process_function
can be any function that processes the lists. In our example, this would be the call to the API or sub processes that process the lists.
Bonus Tip: Keeping track of Progress
Additionally, you can keep track of the progress of the process by wrapping up the for loop iterable using an open source library, tqdm to display a progress bar that also indicates how long each iteration takes. It works for any iterable as well.
from tqdm import tqdm
results = []
for sub_list in tqdm(sub_lists):
partial_result = process_function(sub_list)
results.append(partial_result)
This has come in quite handy for me in quite a few cases.
Cover Photo from CHUTTERSNAP on Unsplash
Linux Senior Consultant
Besides presenting a nice technique for data processing -My, my! Back in the 70's I used to work with batch jobs using punched card stacks-, you also made me discover a very nice open source library, tqdm, that is (was, really) an undiscovered treasure for my toolbox. Keep up the nice work, Nithish!
Thanks for your kind words. I can imagine your pain working with the batch jobs using punched cards. tqdm is definitely one of the nice libraries that you just cannot stop using once you learn about it :)
Comments (2)