Nithish Raghunandanan
Nithish's Blog

Follow

Nithish's Blog

Follow
How to Process Data in Batches in Python

How to Process Data in Batches in Python

Nithish Raghunandanan's photo
Nithish Raghunandanan
·Nov 15, 2020·

3 min read

Play this article

There are some situations when you have a huge list of items to process but you cannot do them in one go due to some limitations of the systems that process the list.

Some examples:

  • When you need to access an API that supports only 100 items at once in the request, you need to split your original list into lists of 100 items & combine the results.
  • You have a long list of items that you want to process parallely. You can split them into the number of sub processes that you want & process them independently.
long_list = list(range(100))
sub_list_length = 10
sub_lists = [
    long_list[i : i + sub_list_length]
    for i in range(0, len(long_list), sub_list_length)
]

Let us try to break down the code

long_list is a list of 100 numbers. We are splitting this list of numbers into sub lists specified by the sub_list_length of 10. The list comprehension is relying on slices of the original list.

print(long_list)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
print(sub_lists)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47, 48, 49], [50, 51, 52, 53, 54, 55, 56, 57, 58, 59], [60, 61, 62, 63, 64, 65, 66, 67, 68, 69], [70, 71, 72, 73, 74, 75, 76, 77, 78, 79], [80, 81, 82, 83, 84, 85, 86, 87, 88, 89], [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]
print(list(range(0, len(long_list), sub_list_length)))
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

Processing the Sub Lists

results = []
for sub_list in sub_lists:
    partial_result = process_function(sub_list)
    results.append(partial_result)

Here process_function can be any function that processes the lists. In our example, this would be the call to the API or sub processes that process the lists.

Bonus Tip: Keeping track of Progress

Additionally, you can keep track of the progress of the process by wrapping up the for loop iterable using an open source library, tqdm to display a progress bar that also indicates how long each iteration takes. It works for any iterable as well.

from tqdm import tqdm
results = []
for sub_list in tqdm(sub_lists):
    partial_result = process_function(sub_list)
    results.append(partial_result)

Progress Bar

This has come in quite handy for me in quite a few cases.

Cover Photo from CHUTTERSNAP on Unsplash

 
Share this