Preface to this Chapter: You’ll learn about using NumPy in this Chapter. NumPy is a vast module with plenty of functionality. The aim of this Chapter is not to cover the entire module in detail. Whole books have been written about this topic. The purpose of this Chapter is to introduce what NumPy is, why it’s useful for numerical applications, and how to start using NumPy. As with all topics in this book, there’s more to learn if you choose to dive deeper into programming and NumPy.
Programming is a subject that spans many diverse applications, from building websites to sending rockets into space and many more. In recent times, more professions have started to rely more on programming. Artists, lawyers, historians, and journalists are just a few professions not historically associated with computer programming but have recently started exploring it.
Quantitative applications are among the earliest fields that started using programming, and in many parts of these fields, programming has become an essential tool. And Python has become the de facto standard programming language for most of these applications.
The term quantitative applications is a broad category that refers to any field that relies on numerical data or data that can be quantified using numbers. Science and finance are two heavy users of quantitative programming, but there are other fields, too.
NumPy
NumPy is a module in Python designed for these applications. The name NumPy stands for Numerical Python. You’ll hear NumPy pronounced in different ways, either ending with an -eye sound since Py stands for Python, or ending with an -ee sound to rhyme with happy.
NumPy is a third-party package, which means it’s not part of Python’s standard library. However, it’s a very well-established package and has been used extensively for many years. In this Chapter, you’ll learn how to start using NumPy, and you’ll also learn how to install third-party modules.
Using Lists For Numerical Programming
Before introducing NumPy, let’s see what’s the problem with using lists for numerical applications. You can explore this in a Console session using a list containing the exam marks for six children:
>>> marks = [25, 42, 33, 23, 14, 22]
You’ve noticed that you made a mistake when marking, and you’ve only assigned half the marks for each question. Therefore, you’d like to multiply all the marks by two:
>>> marks * 2 [25, 42, 33, 23, 14, 22, 25, 42, 33, 23, 14, 22]
However, the result is not what you expect. The output is a list with twelve numbers. The seventh number is the same as the first, the eighth is the same as the second, and so on. The multiplication sign for an object of type list
extends the list by repeating the items in the list.
Indeed, multiplication of lists only works if you multiply by an integer. If you try to multiply by a non-integer, an error is raised:
>>> marks * 1.5 Traceback (most recent call last): File "<input>", line 1, in <module> TypeError: can't multiply sequence by non-int of type 'float'
You cannot multiply a sequence by any number that’s not an integer. You can solve either problem by using a different algorithm:
>>> new_marks = [] ... for mark in marks: ... new_marks.append(mark * 2) ... >>> new_marks [50, 84, 66, 46, 28, 44]
This method will be familiar to you. You create a new empty list and then use a for
loop to iterate through the numbers and append the double of each number to the new list.
Making more modifications
You now decide to add ten marks to each student as you feel you’ve been too strict with your marking. However, if you try adding 10
to the list which contains the new marks you’ll get an error:
>>> new_marks + 10 Traceback (most recent call last): File "<input>", line 1, in <module> TypeError: can only concatenate list (not "int") to list
The TypeError
states that a list cannot be concatenated with an integer but only with another list. This error lets you know that, when used with lists, the +
sign adds the elements of a list to the end of the other list:
>>> [1, 2, 3] + [10, 12] [1, 2, 3, 10, 12]
Therefore, to add ten marks to each students, you’ll have to write a loop again:
>>> final_marks = [] >>> for mark in new_marks: ... final_marks.append(mark + 10) ... >>> final_marks [60, 94, 76, 56, 38, 54]
You can merge the doubling and the addition of ten marks into one for
loop. However, you’ll still need at least one loop when using this method.
Using List Comprehensions
In the Snippets section in Chapter 4, you learned about list comprehensions. List comprehensions are ideal when you’d like to avoid for
loops such as the ones above. You can perform the same operations using list comprehensions:
>>> new_marks = [mark * 2 for mark in marks] >>> new_marks [50, 84, 66, 46, 28, 44] >>> total_marks = [mark + 10 for mark in new_marks] >>> total_marks [60, 94, 76, 56, 38, 54]
List comprehensions allow you to replace the three lines needed in the for
loop with a single line of code. The append()
method is not required, either.
If you prefer, you can perform both operations in one list comprehension:
>>> total_marks = [mark * 2 + 10 for mark in marks] >>> total_marks [60, 94, 76, 56, 38, 54]
You created total_marks
directly from the original list of marks without using the intermediate list new_marks
. If you have many mathematical operations you need to perform on several data structures, using list comprehensions is quicker and produces more compact code. However, there is an easier way to perform these operations using NumPy.
Installing Third-Party Modules
You’ve already used the import
keyword several times to import modules such as random
. However, if you try to import numpy
, you’re likely to get a ModuleNotFoundError
since numpy
is not part of the standard library.
The Python standard library includes modules which the Python Steering Council decides should be part of the standard Python distribution. When you installed Python from python.org before you started learning, you installed the Python language and all the modules that are part of the standard library on your computer.
However, there are many more modules in Python. Indeed, one of the features that make Python such a powerful and popular language is the vast range of available modules covering many different applications. You wouldn’t want over a hundred thousand modules to be installed on your computer when you’ll never use the majority of them.
For this reason, you need to install any third-party modules you need. Installing modules should be straightforward. However, that’s not always the case. And there’s more than one way to install modules, which can make it more confusing. Let’s look at some of the options.
Using your IDE to install modules
The easiest way to install third-party modules is to use your IDE. As I’ve done elsewhere in this book, I’ll use PyCharm as an example to demonstrate this feature.
You can open the Preferences or Settings window, depending on what operating system you’re using. One of the options in the sidebar on the left is the Project menu item which PyCharm displays as Project: <Your Project Name>. You can then choose the Python Interpreter option. The interpreter is the version of Python you’re using. If you have several Python versions installed, this is where you can change between versions.
You’ll also see a panel showing Packages that are installed. You probably won’t have too many listed here for now. This panel only shows third-party modules, and therefore, modules that are part of the standard library are not shown.
You’ll also find a small + icon that will show Install when you hover over it. When you click on the + icon, a new window opens up, offering you a list of all third-party modules available for you to download. You can start scrolling down, but you’ll see the scrollbar is progressing very slowly! There are many modules available to download. Therefore, it’s best to use the search bar in the Available Packages window.
Type in numpy
in the search bar, and you’ll see the numpy
module appear at the top. Note that the module name is written in all lowercase letters. Make sure it’s selected and click on the Install Package button. After a few seconds, you’ll see a success message, and you’re done. You can now close these windows, and you’re all set. You now have access to the numpy
package.
Using pip
It’s helpful to know what’s happening underneath the hood when installing a package using your IDE. Python has a package management system that allows you to install and manage packages. The default package installer for Python is pip
which you’ve installed along with Python when you first installed the language.
Your IDE has an in-built Terminal. The Terminal is the command-line tool that allows you to access the operating system. The Terminal is not specific to Python, and even without PyCharm, you’ll still have access to a Terminal or a similar tool on your Windows, Mac, or Linux operating system.
You should be able to open your Terminal in PyCharm by clicking on the Terminal tab that’s usually at the bottom of your PyCharm window, or you can find it in the View/Tool Windows menu bar.
You can type the following command after the prompt symbol in the Terminal:
pip install numpy
or
python -m pip install numpy
This code will install numpy
in the same way you did when you used the IDE method earlier. You can try this out even if you’ve already installed numpy
.
Using Anaconda
Although pip
is the default package manager, it’s not the only one. Another popular one often used in quantitative applications is conda
, which is the default package manager when installing a programming environment using Anaconda. This is not relevant to you if you’ve followed the instructions in Chapter 0 of this book. However, if you installed Python using Anaconda, NumPy is already present as it’s included by default in the Anaconda distribution.
I won’t discuss Anaconda and the conda
package manager further here, although you may read and hear about it as an alternative to pip
.
Introducing NumPy
NumPy introduces a new data type that forms the basis of all numerical programming in Python. It also includes functionality that allows you to manipulate numerical data by performing many mathematical operations.
The primary data type introduced by NumPy is the ndarray
. Let’s return to the example with the student marks above. You can convert the list into a numpy.ndarray
:
>>> import numpy >>> marks = [25, 42, 33, 23, 14, 22] >>> type(marks) <class 'list'> >>> marks = numpy.array(marks) >>> type(marks) <class 'numpy.ndarray'> >>> marks array([25, 42, 33, 23, 14, 22])
You start with creating marks
as a list, as you’ve done earlier. Then, you use the array()
function in the numpy
module to convert this to a numpy.ndarray
. An ndarray
is a sequence, and therefore you can use indexing and slicing in the same way as for other sequences:
>>> marks[2] 33 >>> marks[1:4] array([42, 33, 23])
An ndarray
is also iterable, and therfore you can use it in a for
loop:
>>> for mark in marks: ... print(mark) 25 42 33 23 14 22
However, you won’t need to iterate through an ndarray
in this fashion in many instances, as you’ll see shortly.
Using NumPy’s ndarray
You can now transform the marks in the same way you did earlier, by doubling all the marks and then adding 10
to each one:
>>> new_marks = marks * 2 >>> new_marks array([50, 84, 66, 46, 28, 44]) >>> final_marks = new_marks + 10 >>> final_marks array([60, 94, 76, 56, 38, 54])
You’ll recall that you couldn’t simply multiply and add when using a list, but you can do so with an ndarray
. Operations are performed on an element-by-element basis. You can even apply arithmetic operators to pairs of arrays:
>>> numbers_a = numpy.array([2, 3, 4, 5]) >>> numbers_b = numpy.array([8, 9, 10, 11]) >>> numbers_a + numbers_b array([10, 12, 14, 16]) >>> numbers_a * numbers_b array([16, 27, 40, 55])
Adding and multiplying the two arrays returns another array in which each item is the result of the operators applied to the corresponding pair of elements from the two arrays. Note that the array()
function converts another sequence into an ndarray
, and this is why the arguments in the first two lines above are lists.
Lists aren’t designed for mathematical operations. A list is a flexible container for any type of data. However, the ndarray
data type is designed specifically for numerical data, and therefore, it makes mathematical operations a lot more straightforward. As you’ll see later, the ndarray
data type is not there just to make writing code a bit quicker and less error-prone. It provides other advantages, too.
Importing Using An Alias
Time for a short detour. You’ll recall from The White Room analogy that when you import a module, it’s as though you’re bringing a book from the library with lots of additional commands within it. When you import random
, for example, Monty leaves the White Room and heads to the library where he’ll fetch the book labelled random
. He’ll bring it back to the White Room, which represents the computer program environment. Finally, Monty places this book on the shelf, ready to be used whenever he encounters the name random
.
When you import a module, you can choose to rename it using an alias. Adding as alias is equivalent to sticking a new label on the book, covering the old name. Let’s look at an example:
>>> import random as feeling_lucky >>> feeling_lucky.randint(0, 5) 4 >>> random.randint(0, 5) Traceback (most recent call last): File "<input>", line 1, in <module> NameError: name 'random' is not defined
You import the random
module using the as
keyword, and you provide an alternative name you wish to use for the module. Once Monty brings in the random
book from the library, he sticks a label covering the name random
on the book, and he writes feeling_lucky
on the label before placing the book on the shelf. You can use any alias name you wish.
When you write feeling_lucky
, you’re fetching the book that used to be called random
. Its contents are still the same, as you can see when using the randint()
function. Notice that you can no longer use the name random
in this program. The alias only applies to the specific program you’re writing. You’re not changing the name of the module permanently!
By convention, numpy
is always imported using the alias np
:
>>> import numpy as np
This convention comes from the older days when IDEs with auto-completion features weren’t available. Using the alias np
saved a lot of typing in those days. It also makes lines of code shorter and more concise. In any case, this is now the convention, and you’ll see that all Python code that includes numpy
uses this alias.
Exploring NumPy’s ndarray
There are some useful properties of an ndarray
that you’ll need when dealing with this data type. The first of these is the data type of the items in the array. You can find what this is using the dtype
property:
>>> import numpy as np >>> numbers = np.array([3, 4, 5, 6]) >>> numbers array([3, 4, 5, 6]) >>> numbers.dtype dtype('int64')
The variable numbers
is of type np.ndarray
. However, NumPy identified its contents as integers. Therefore, numbers.dtype
shows the data type of the array’s contents. The data type of the array’s items is int64
. This is NumPy’s own version of an integer. The suffix 64
indicates this is a 64-bit integer, showing how much memory is used to store this number. I would recommend you don’t worry too much about the difference between Python’s native int
and NumPy’s int64
or other NumPy integer types for the time being.
You can convert the items in the ndarray
from one data type to another using the astype()
method:
>>> numbers = numbers.astype('float') >>> numbers.dtype dtype('float64') >>> numbers array([3., 4., 5., 6.])
You convert the items in numbers
into a float on the first line, as shown when displaying the dtype
property. Note that when you display the array, the numbers are followed by a dot to indicate their data type is a float and not an integer type, even though they’re whole numbers.
The shape of an array
Another useful property of an array is its shape
:
>>> numbers.shape (4,)
The property shape
will make more sense with higher-dimension arrays. So, you can create a two-dimensional array by converting a list of lists into an ndarray
:
>>> numbers = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) >>> numbers array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) >>> numbers.shape (3, 4)
The first line creates a two-dimensional array with three rows and four columns. The shape
of the ndarray
shows the size of each dimension. ndarray
stands for n-dimensional array, and therefore, you can create arrays of any dimension. You can get the number of dimensions of an array using the ndim
property:
>>> numbers.ndim 2
This confirms that numbers
is a two-dimensional array. When you index an n-dimensional array, you can provide indices for each dimension:
>>> numbers array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) >>> numbers[1, 3] 8
You reference the item on the second row (index 1
) and the fourth column (index 3
). You can also mix and match between indexing and slicing:
>>> numbers[1:3, 3] array([ 8, 12]) >>> numbers[1:3, :2] array([[ 5, 6], [ 9, 10]]) >>> numbers[1:3, :] array([[ 5, 6, 7, 8], [ 9, 10, 11, 12]])
In the first example above, you reference the items in the second and third rows (slice 1:3
) and the fourth column (index 3
). You’ll recall that the endpoint of a slice is excluded. The second example shows the items in the second and third rows (slice 1:3
) and in the first and second columns (slice :2
). The slice :2
is the same as 0:2
. In the final example, you reference all the columns using the slice :
representing the full range.
Comparing Loops, List Comprehensions, and Arrays
Let’s get back to performing mathematical operations on number sequences. You have a large sequence of temperatures in Celsius (ºC), and you want to convert them to Fahrenheit (ºF). For this exercise, you can create a random list of temperatures to work with:
import random temperatures = [random.randint(-100, 350) / 10 for _ in range(1_000_000)]
You create a list called temperatures
using a list comprehension, which you learned about in the Snippets section in Chapter 4. Let’s break down the list comprehension:
random.randint(-100, 350)
creates a random integer in that range. This integer is then divided by10
so that the result is a number between-10.0
and35.0
, with one decimal place.randint()
only generates integers, so this is one way of creating random floats.- The
for
statement uses an underscore as the variable name. This technique is common practice when you don’t need to use the variable in your algorithm. range(1_000_000)
is the same asrange(1000000)
. The underscore within numbers makes them easier to read, just as commas (or full stops in some languages) are used when writing numbers, such as 1,000,000.- The list comprehension, therefore, creates a million temperatures between -10ºC and +35ºC.
You can print out the first few items of the list if you’d like to see its contents.
Creating a function to convert the temperatures
You can now create a function that takes a list as an input argument and returns another list with temperatures that have been converted from ºC to ºF:
import random temperatures = [random.randint(-100, 350) / 10 for _ in range(1_000_000)] def convert_loop(data): result = [] for temperature in data: result.append(temperature * 1.8 + 32) return result print(temperatures[:10]) print(convert_loop(temperatures)[:10])
The function convert_loop
converts temperatures using the "classic" for
loop method. Although you’re using a list as an argument for the function in this case, you can use any iterable.
You initialise the list result
as an empty list, and then you append the temperature in ºF by multiplying the temperature in ºC by 1.8
and adding 32
. The function returns the list with the results. You can check that this works by printing the first ten temperatures in each list. In the last line, you’re using a shortcut. You’re calling the function, which is immediately followed by the indexing notation [:10]
. This notation is possible since the function returns a list:
[-0.9, -5.2, 4.5, 29.1, 20.1, 26.6, -8.6, 17.3, 3.8, 25.1]
[30.38, 22.64, 40.1, 84.38, 68.18, 79.88, 16.52, 63.14, 38.84, 77.18]
The conversion has worked. The second array has temperatures in ºF which correspond to the temperatures in ºC in the first array.
Using list comprehensions
You can now write a second version of this function which uses list comprehensions:
import random temperatures = [random.randint(-100, 350) / 10 for _ in range(1_000_000)] def convert_loop(data): result = [] for temperature in data: result.append(temperature * 1.8 + 32) return result def convert_comp(data): result = [temperature * 1.8 + 32 for temperature in data] return result print(convert_loop(temperatures) == convert_comp(temperatures))
The second function definition also takes an iterable as an argument. This function follows the same logic as the first one but uses a list comprehension instead. As a sanity check, in the final line of the code above, you compare the lists returned by the two functions to make sure they’re identical. This gives the following output:
True
The result of the equality operator ==
shows that the two functions perform the same task.
Timing function execution
Apart from saving a few lines of code, is there any difference between these two functions? Before answering this question, I’ll introduce a new module that’s part of the Python standard library called timeit
. This module has tools that allow you to time how long code takes to run.
>>> import timeit >>> timeit.timeit("a=5", number=1_000_000) 0.011481411999994862
You’re using a function called timeit()
from the module that’s also called timeit
. The statement a=5
is executed a million times, and the result of the timeit()
function is the number of seconds this takes. The process of assigning an integer to a variable name doesn’t take up too much time.
If you have predefined variables, you’ll need to pass these to the timeit()
function by using the globals()
built-in function which returns all the objects defined in the environment:
>>> a = 5 >>> b = 10 >>> timeit.timeit("a+b", number=1_000_000, globals=globals()) 0.05935179500011145
The result of the globals()
function is used as a keyword argument associated with the timeit()
parameter globals
.
Comparing loops and list comprehensions
You can time the execution of the two functions you wrote to convert temperatures:
import random import timeit temperatures = [random.randint(-100, 350) / 10 for _ in range(1_000_000)] def convert_loop(data): result = [] for temperature in data: result.append(temperature * 1.8 + 32) return result def convert_comp(data): result = [temperature * 1.8 + 32 for temperature in data] return result print(convert_loop(temperatures) == convert_comp(temperatures)) print('\nUsing the "classic" method with a for loop and a list:') print(timeit.timeit("convert_loop(temperatures)", number=100, globals=globals())) print("\nUsing a list comprehension:") print(timeit.timeit("convert_comp(temperatures)", number=100, globals=globals()))
In the final few lines you’re timing how long the two functions take to run. You’re running each function a hundred times. The exact time this will take to run depends on what computer you’re using and what other processes may be occurring in the background in your operating system. The results I obtained on my laptop are the following:
True
Using the "classic" method with a for loop and a list:
13.20451985
Using a list comprehension:
9.715821733
You can see that list comprehensions win the race in this instance. In general, list comprehensions are more efficient than the for
loop method.
Comparing with NumPy’s ndarray
You can write a third version of the function using NumPy’s ndarray
data type. This function is only a first attempt, as you’ll write a different version further on:
import random import numpy as np import timeit temperatures = [random.randint(-100, 350) / 10 for _ in range(1_000_000)] def convert_loop(data): result = [] for temperature in data: result.append(temperature * 1.8 + 32) return result def convert_comp(data): result = [temperature * 1.8 + 32 for temperature in data] return result def convert_numpy(data): return np.array(data) * 1.8 + 32 print( convert_loop(temperatures) == convert_comp(temperatures) == list(convert_numpy(temperatures)) ) print('\nUsing the "classic" method with a for loop and a list:') print(timeit.timeit("convert_loop(temperatures)", number=100, globals=globals())) print("\nUsing a list comprehension:") print(timeit.timeit("convert_comp(temperatures)", number=100, globals=globals())) print("\nUsing numpy:") print(timeit.timeit("convert_numpy(temperatures)", number=100, globals=globals()))
The function convert_numpy()
also accepts any iterable as an input argument. The function converts this into a NumPy ndarray
and then performs the required arithmetic operations. It’s easier to write and read the code as it doesn’t have an explicit for
statement since operations on an ndarray
are performed on an element-by-element basis.
The sanity check is confirming that all three functions return the same result. As the function convert_numpy
returns an ndarray
and not a list, you’re converting its output to a list when you compare it with the previous two functions.
Finally, you’re timing how long this new function takes to run. The output obtained is the following:
True
Using the "classic" method with a for loop and a list:
13.463441471
Using a list comprehension:
9.827219727
Using numpy:
6.586534716999999
The value True
output by the equality checks shows that the functions are performing the same task. The list comprehension version has now slipped into silver medal position as the NumPy version is quicker.
Second NumPy version
However, this is not a fair comparison. The function convert_numpy()
has an additional task to perform compared to the others. It needs to convert the original list into an array. The other functions didn’t need to do any type conversions.
You can convert the list into an array before you call the function, and you can write another version of the function which requires an ndarray
as an input argument rather than any iterable:
import random import numpy as np import timeit temperatures = [random.randint(-100, 350) / 10 for _ in range(1_000_000)] temperatures_array = np.array(temperatures) def convert_loop(data): result = [] for temperature in data: result.append(temperature * 1.8 + 32) return result def convert_comp(data): result = [temperature * 1.8 + 32 for temperature in data] return result def convert_numpy(data): return np.array(data) * 1.8 + 32 def convert_numpy_2(data: np.ndarray): return data * 1.8 + 32 print( convert_loop(temperatures) == convert_comp(temperatures) == list(convert_numpy(temperatures)) == list(convert_numpy_2(temperatures_array)) ) print('\nUsing the "classic" method with a for loop and a list:') print(timeit.timeit("convert_loop(temperatures)", number=100, globals=globals())) print("\nUsing a list comprehension:") print(timeit.timeit("convert_comp(temperatures)", number=100, globals=globals())) print("\nUsing numpy (converting to array inside function):") print(timeit.timeit("convert_numpy(temperatures)", number=100, globals=globals())) print("\nUsing numpy (converting to array before function):") print( timeit.timeit("convert_numpy_2(temperatures_array)", number=100, globals=globals()) )
You have now written the function convert_numpy_2()
to accept an ndarray
. The type hint in the function signature reinforces this, although this is not required. The output from this code shows that this final version is significantly faster than any of the other versions:
True
Using the "classic" method with a for loop and a list:
13.665791161
Using a list comprehension:
9.722372186999998
Using numpy (converting to array inside function):
6.823992463999996
Using numpy (converting to array before function):
0.10436863599999668
Transforming an ndarray
into another ndarray
through arithmetic operations is significantly quicker than performing the same operation using lists. The increase in efficiency will depend on the operations you’re performing. It may not always be as significant an improvement as in the example above. However, this example illustrates how, in many situations, using NumPy arrays is more efficient than using Python’s built-in data types.
When dealing with very large datasets and intensive computations, you can gain a lot of time by using NumPy.
Why is NumPy so quick?
Why is NumPy so much quicker? There’s no magic involved, as you’ll learn in this section. This Chapter doesn’t aim to cover every detail of the NumPy package, so I won’t go through everything that’s happening underneath the hood. However, it’s useful to gain a general understanding of how NumPy works and why it’s so quick compared with other tools in Python.
Python is a language that prioritises speed in developing code. This means that writing code in Python is quicker than some other lower-level languages, such as C. This comes at a cost. A price to pay is the speed of executing the code. Therefore, it’s significantly quicker to write a program in Python compared to C. However, once the code is written, running the code will be faster in C than Python.
Part of the reason for this difference is that when you run a Python program, the code needs to be compiled or translated into different versions of the code using more machine-readable languages. Python is a high-level language. You can think of this as the closest to a human-readable language you can get for a programming language. Low-level languages are meant for the machine to read and are not easy to read for a human. There are several steps required to convert the Python code you write to the 0s and 1s which then become voltages in the computer’s chips.
Some languages, such as C, use a different approach. When you write your code, you will first compile the program, which means you’re making some of those "translations" in advance. You’ll then run the compiled code when you need to run the program, which is quicker.
NumPy’s pre-compiled C Code
NumPy uses pre-compiled C code. This means that underneath the hood, when you call a NumPy function, the heavy lifting is done using C code that has already been "translated" to machine-readable language.
Therefore, NumPy gives you the best of both worlds. You have the relative ease and speed of developing code that Python offers and the speed of execution from pre-compiled C code.
Don’t worry too much about what’s happening underneath the hood. You can use NumPy without knowing any of this detail. Indeed, you can ignore the "Why is NumPy so quick?" section entirely if you wish. Maybe, I should have said so at the top of this section!
Using Documentation
There’s a lot more to NumPy than what I’m introducing in this Chapter. When dealing with large packages such as NumPy, you’ll need to use the package’s documentation.
In the Snippets section at the end of this Chapter, I’ll discuss how to look for and get help in Python. I’ll discuss using Python’s help()
function, using Python’s online official documentation, looking for information elsewhere online, and using the documentation for third-party packages such as NumPy.
You can get a sense of how large NumPy is by looking at the table of contents of NumPy’s documentation. If you click on the items on the sidebar on the left of the documentation page, you’ll be able to drill through to see the package’s various functions and other contents. The Routines section, in particular, contains most of the functionality within NumPy.
In the rest of this Chapter, I’ll look at some of the operations you can perform on data using NumPy.
Using NumPy
The ndarray
data type introduced by NumPy is not the only feature of this package. NumPy is designed for numerical programming and has a vast range of built-in functionality to deal with numerical data.
Let’s explore a few of these:
>>> import numpy as np >>> numbers = np.array([5, 6, 3.2, 2, 6, 3, 2.3]) >>> numbers array([5. , 6. , 3.2, 2. , 6. , 3. , 2.3]) >>> numbers.dtype dtype('float64') >>> numbers.max() 6.0 >>> numbers.mean() 3.9285714285714284 >>> numbers.min() 2.0 >>> numbers.sum() 27.5
You start by converting a list containing a mixture of integers and floats into a NumPy ndarray
. Note that the items in the array have all been converted to floats so that they’re all the same data type. You confirm this by displaying the value of numbers.dtype
.
The rest of the examples show you some of the basic operations you may need to perform on your data. In these examples, you’re finding:
- the maximum value in the array
- the mean of the numbers in the array
- the minimum value in the array
- the sum of all numbers in the array
Although these are basic mathematical operations, NumPy also contains other functions that perform many more tasks.
Looking at two-dimensional arrays
You can try to use the same methods on a two-dimensional array:
>>> numbers = np.array([[1, 5, 9, 13], [5, 2, 8, 6], [19, 10, 1, 9]]) >>> numbers array([[ 1, 5, 9, 13], [ 5, 2, 8, 6], [19, 10, 1, 9]]) >>> numbers.max() 19 >>> numbers.max(0) array([19, 10, 9, 13]) >>> numbers.max(1) array([13, 8, 19])
You’re converting a list of lists into a two-dimensional ndarray
consisting of three rows and four columns. As was the case with the one-dimensional array, the max()
method returns the maximum value in the array. However, you can also find the maximum along either of the two axes of the array.
numbers.max(0)
finds the maximum values along the first axis, which is axis=0
. The result is a single "row" with the maximum value from each column. You can also get the maximum values along the second axis using numbers.max(1)
, which returns a single "column" with the maximum values from each row.
There are other optional arguments you can use with max()
. You can see the full signature below, although my advice is not to try to understand every parameter of every function at first. It’s best to learn about each parameter as and when you need it:
ndarray.max(axis=None, out=None, keepdims=False, initial=<no value>, where=True)
You can read more about the max()
method in NumPy’s documentation.
Using NumPy with or without the OOP approach
In the example above, you would have recognised patterns you’ve learned about in the previous Chapter about object-oriented programming. The object numbers
is of type ndarray
. You’re not creating the object directly, but the function array()
creates this instance of the class ndarray
as it converts a sequence into an array.
The functions max()
, min()
, mean()
, and sum()
are methods of the class ndarray
. They are functions that are accessible to objects of the class which act on those objects.
In the previous Chapter, I described Python as a multi-paradigm language. This means you have a choice on which style of programming to use. Packages such as NumPy perfectly represent this because they offer different ways of achieving the same functionality.
If you’re not using the OOP approach, you can use max()
directly as a function instead of a method:
>>> np.max(numbers, 1) array([13, 8, 19])
You call the max()
function directly from the NumPy module, which you import using the alias np
. However, as you’re not using this as a method, you need to pass the array as the first argument in the function. The axis is the second argument. The array returned is the same as when you used numbers.max(1)
.
Incidentally, with functions and methods in packages such as NumPy, which often can take several optional arguments, it’s common to use keyword arguments:
>>> np.max(numbers, axis=1) array([13, 8, 19]) >>> numbers.max(axis=1) array([13, 8, 19])
The OOP approach can be easier and simpler to use and I would recommend using this approach. However, there are instances when the OOP approach is not the preferred style. Later in this book, you’ll read about another programming paradigm called Functional Programming, which takes a very different approach compared with OOP. NumPy allows you to use either style.
Using NumPy’s ndarray
With Boolean Indices
You’ll recall that a number of operators in Python, such as the greater-than operator >
, return a Boolean value. You can try using >
with a list:
>>> more_numbers = [3, 6, 2, 9, 10, 1, 7] >>> more_numbers > 4 Traceback (most recent call last): File "<input>", line 1, in <module> TypeError: '>' not supported between instances of 'list' and 'int'
This operator is not supported when you try to compare a list with an integer. How would you create a new list that only contains the items from the original list which are greater than 4
? Before reading further, you should try to write the few lines of code you’ll need to do this, and if you feel comfortable using list comprehensions, you could even write two versions of this algorithm, one using loops and another using list comprehensions.
You’ve probably guessed by now that a NumPy ndarray
may behave differently to a list in this case:
>>> import numpy as np >>> more_numbers = [3, 6, 2, 9, 10, 1, 7] >>> numbers = np.array(more_numbers) >>> numbers > 4 array([False, True, False, True, True, False, True])
Once you convert the list to an ndarray
, you can use the >
operator which returns another ndarray
containing Boolean values. The greater-than operator acts on the array on an element-by-element basis.
Further indexing with NumPy arrays
Earlier in this Chapter, you learned that you can use indexing with an ndarray
in a similar fashion as you would with lists. However, you can go even further with NumPy’s array. Using the same ndarray
you created in the previous example:
>>> numbers array([ 3, 6, 2, 9, 10, 1, 7]) >>> numbers[2] 2 >>> numbers[2:5] array([ 2, 9, 10]) >>> numbers[[True, False, True, False, True, False, True]] array([ 3, 2, 10, 7])
The first two examples show indexing ([2]
) and slicing ([2:5]
). In the final example, you’re using a list of Booleans as the index. Notice that there are two sets of square brackets; the outer one indicates you’re indexing the array, the inner one indicates the object you’re using as an index is a list. This list has the same number of items as the ndarray
.
Using a list of Booleans to index an array results in another array that’s a subset of the original one. The array items which correspond to a True
value are retained, whereas the items which correspond to False
are discarded. You can also use an ndarray
of Booleans instead of a list of Booleans as an index.
You can now combine the operations you’ve learned in this section. The greater-than operator returns an array of Boolean values when used with an ndarray
, and an array of Booleans can be used as an index for another ndarray
. You can therefore filter an array directly through indexing:
>>> numbers array([ 3, 6, 2, 9, 10, 1, 7]) >>> numbers[numbers > 4] array([ 6, 9, 10, 7])
The statement inside the square brackets returns an array of Booleans which is then used to filter numbers
.
Reading Data From A Spreadsheet Using NumPy
Many quantitative applications rely on data from external sources which are not created directly in the computer program. Often, this source is a spreadsheet.
You can read data directly from an Excel file. However, in this section, you’ll read data from a CSV file. You can always export any Excel spreadsheet as a CSV file.
The CSV file format is the most basic of formats for a spreadsheet. It’s a text file with items separated by commas and a newline character at the end of each line. The commas separate items that belong to different cells in a spreadsheet. Every spreadsheet program can open CSV files.
The Met Office temperature dataset
In this example, you’ll use real-world data provided by the Met Office, the UK’s national meteorological service. The Hadley Centre Central England Temperature (HadCET) dataset is the longest temperature record in the world. You’ll use the mean daily temperature data set from January 1772 until July 2021. For ease of use in this project, you can download the CSV file with this data from The Python Coding Book File Repository.
Download The Python Coding Book File Repository
Through the link above, you can download the folder you need directly to your computer. I would recommend this option which is the most straightforward. But if you prefer, you can also access the repository through Github.
NOTE: As the content of The Python Coding Book is currently being gradually released, this repository is not final, so you may need to download it again in the future when there are more files that you’ll need in later chapters.
There are two files you’ll need for this project from the repository:
mean_daily_temperatures_from_1772.csv
is the CSV file containing the datamean_daily_temperatures_format.html
is an HTML file with an explanation of the data’s format
The HTML file includes the following content:
Column 1: year
Column 2: day
Columns 3-14: daily CET values expressed in tenths of a degree. There are 12 columns; one for each of the 12 months.
1772 1 32 -15 18 25 87 128 187 177 105 111 78 112
1772 2 20 7 28 38 77 138 154 158 143 150 85 62
1772 3 27 15 36 33 84 170 139 153 113 124 83 60
1772 4 27 -25 61 58 96 90 151 160 173 114 60 47
1772 5 15 -5 68 69 133 146 179 170 173 116 83 50
1772 6 22 -45 51 77 113 105 175 198 160 134 134 42
The first column contains the year and the second column represents the day of the month. The remaining columns represent the twelve months. Therefore, the value in the third column of the first line is the mean temperature on the 1st of January 1772, and the value in the fourth column of the first line is the temperature on the 1st of February 1772, and so on.
The dataset gives the temperatures in tenths of a degree, which means that all numbers are integers. You can convert the temperatures to degrees by dividing by ten.
Reading the CSV file using NumPy
Once you’ve transferred the CSV file to your project folder, you’re ready to read in the data from this file:
>>> import numpy as np >>> data = np.loadtxt("mean_daily_temperatures_from_1772.csv", delimiter=",", dtype=int) >>> data array([[1772, 1, 32, ..., 111, 78, 112], [1772, 2, 20, ..., 150, 85, 62], [1772, 3, 27, ..., 124, 83, 60], ..., [2021, 29, 79, ..., -999, -999, -999], [2021, 30, 29, ..., -999, -999, -999], [2021, 31, 8, ..., -999, -999, -999]]) >>> data.shape (7750, 14)
Notice that when the arrays are large, the output doesn’t show all the values and uses ellipsis ...
instead. You’re using the loadtxt()
function in the NumPy module to load data from a text file such as a CSV file. The first argument in loadtxt()
is the filename as a string. There are also two further keyword arguments. One defines the delimiter in the text file so that the function knows how to separate the data into items. In this case, you’re using the comma as the delimiter since the data is in a CSV file. The final parameter allows you to define the data type. In this case, all the values in the CSV file are integers.
The loadtxt()
function returns a two-dimensional array with 7750
rows and 14
columns, the same as the number of rows and columns in the CSV file.
In the Snippets section at the end of this Chapter, you’ll learn how to read data from a CSV file from first principles without using NumPy.
Filtering and analysing the data
For the time being, you’re not interested in the association between the temperatures and the dates. You’ll come back to this dataset in a later Chapter when you learn about another package called Pandas. All you need for now is the set of temperatures.
You can remove the first two columns of this two-dimensional array:
>>> temperatures = data[:, 2:] >>> temperatures array([[ 32, -15, 18, ..., 111, 78, 112], [ 20, 7, 28, ..., 150, 85, 62], [ 27, 15, 36, ..., 124, 83, 60], ..., [ 79, -999, 130, ..., -999, -999, -999], [ 29, -999, 121, ..., -999, -999, -999], [ 8, -999, 143, ..., -999, -999, -999]]) >>> temperatures.shape (7750, 12)
In the first line, you create an array called temperatures
from data
by including all the rows but only a subset of the columns. The first colon in data[:, 2:]
is the range of indices you’re choosing for the rows. The colon by itself indicates you want to use all the rows available. You use the slice 2:
for the columns, which means you’d like to start from the column with index 2
and go all the way to the end. The columns with indices 0
and 1
are excluded.
You may have noticed some entries in the original data have the value -999
. These represent dates for which there are no temperature readings, for example, cells representing the 30th and 31st of February or cells at the end of the spreadsheet representing dates that were in the future when this dataset was downloaded.
You can filter out these values using the methods you learned earlier:
>>> temperatures = temperatures[temperatures > -999] >>> temperatures array([ 32, -15, 18, ..., 143, 139, 161]) >>> temperatures.shape (91158,)
The filtering step has now removed all occurrences of -999
. This operation has also flattened the array. It’s no longer a two-dimensional array, but it’s now a one-dimensional array with 91,158 entries.
Basic statistical analysis of data
You’ll recall that the original data shows the temperatures in tenths of a degree, so you can divide by ten to obtain temperatures in degrees. You’re now set to find out a bit more about the set of temperatures in England from 1772:
>>> temperatures = temperatures / 10 >>> temperatures.max() 25.2 >>> temperatures.min() -11.9 >>> temperatures.mean() 9.393401566510894 >>> temperatures.std() 5.347949829419111 >>> counts, bins = np.histogram(temperatures) >>> counts array([ 15, 217, 2062, 9187, 18496, 20653, 19445, 17040, 3793, 250]) >>> bins array([-11.9 , -8.19, -4.48, -0.77, 2.94, 6.65, 10.36, 14.07, 17.78, 21.49, 25.2 ])
The dataset represents mean daily temperatures. The highest and lowest mean temperatures over the entire period were 25.2ºC and -11.9ºC, respectively. You’re also calculating the mean and standard deviation of the data using the ndarray
methods mean()
and std()
.
In the final analysis, you use the NumPy function histogram()
, which groups the data into equally spaced ranges or bins. The bins
array shown contains the boundaries of the ranges, and the counts
array includes the number of temperatures that fall in that range. For example, the first bin includes the temperatures between -11.9ºC and -8.19ºC (the first two items in bins
), and 15
temperatures in the whole dataset fall in this range, as shown by the first item in counts
.
The histogram()
function created ten bins, which is the default number. Therefore, there are ten items in the array counts
. However, the array bins
contains eleven items as it also contains the endpoint of the last bin. The first and last items in bins
are the minimum and maximum values you obtained earlier.
In a later Chapter, when you’ll learn about the basics of data visualisation using Matplotlib, you’ll return to this result, and you’ll plot the histogram, rather than just reading the numbers from the console output.
Views and Copies: A Very Abridged Discussion
As mentioned earlier, this Chapter is not a detailed study of NumPy but a good introduction. The topic of views and copies is one that can mostly be left for a more advanced text dealing with NumPy. However, it’s worth a brief mention here, too. It will also serve as a helpful insight into the behind-the-scenes of a computer program, as discussed in The White Room analogy.
Let’s return briefly to the White Room and create a box with a label:
>>> numbers = [5, 7, 9, 11] >>> id(numbers) 140445518556544
The box has the label numbers
, and its content is a list. The built-in id()
function returns the "true identity" of the object. You can think of this as the box’s serial number (yes, I know, cardboard boxes don’t have unique serial numbers, but just pretend they do.) Each object in a program will have a unique id number. Usually, this is a reference to the memory location where the object is stored. Note that the id number you’ll get will be different to the one shown here.
With the following assignment, all you’re doing is adding a second label to the same box:
>>> more_numbers = numbers >>> id(more_numbers) 140445518556544
You can see that the id of numbers
and more_numbers
is the same as these names are two labels for a single box. When using the above assignment, you didn’t create a second list. You can also confirm this by changing one of the values of more_numbers
and then looking at the values of numbers
:
>>> more_numbers[2] = 1000 >>> more_numbers [5, 7, 1000, 11] >>> numbers [5, 7, 1000, 11]
The third item in more_numbers
is the same object as the third item in numbers
.
NumPy views and copies
What’s this got to do with NumPy? When you use indexing with a NumPy ndarray
, something similar happens:
>>> import numpy as np >>> original_array = np.array([2, 3, 4, 5, 6]) >>> a_view = original_array[3:] >>> a_view array([5, 6]) >>> a_view[0] = 1000 >>> original_array array([ 2, 3, 4, 1000, 6])
The result from indexing an ndarray
is called a view. The object original_array
is different from the object a_view
. You can confirm this using the id()
function if you wish. However, the items within them are the same items. Therefore, the fourth item in original_array
is exactly the same object as the first item in a_view
. Both original_array[3]
and a_view[0]
point to the same object.
You can confirm this by reassigning a different number to a_view[0]
and then looking at the values of original_array
, as you’ve done in the example above.
When dealing with large arrays, creating views in this fashion means that your program uses memory efficiently by not having lots of duplicate data.
If you do want to create a copy of all items in the array, you can use the copy()
method:
>>> original_array = np.array([2, 3, 4, 5, 6]) >>> a_copy = original_array[3:].copy() >>> a_copy array([5, 6]) >>> a_copy[0] = 1000 >>> original_array array([2, 3, 4, 5, 6])
This code is similar to your previous version except that you’re using the copy()
method on the second line. The array a_copy
has objects with the same value as the original_array
, but they’re not the same objects.
Representing Equations in Python Using NumPy
I’ll conclude this Chapter with a different use case for NumPy, which is helpful in several areas of science, mathematics, economics, and more. In this section, you’ll translate mathematical equations into Python code using NumPy.
Each field has its own equations that describe various processes and theories. These range from very simple to relatively complex equations. Anything which can be represented mathematically can also be represented computationally. In this section, you’ll explore a basic example to illustrate the general technique of vectorising an equation. However, you can apply the same principles to more complex equations, too.
Consider the following equation:

You’re curious to know what this relationship looks like and how it changes for different values of a. You can explore this with a computer program using NumPy.
It’s tempting to try and write this mathematical equation in a Python script:
import math y = math.sin(x-a)/x
The built-in module math
contains useful mathematical functions, as the name implies. Here, you’re using the sin()
Python function, which works out the value of the sine function. You’ll read about the difference between Python functions and maths functions in a later Chapter, so for the time being, I’ll just say that they’re not the same thing.
Working towards writing equations in Python
If you’re using an IDE which highlights errors before you run the program, you’ll see quite a bit of red underlining in this code. If you run this code, the following error is raised:
File "<path>/<filename>.py", line 3, in <module>
y = math.sin(x-a)/x
NameError: name 'x' is not defined
The name x
is not defined, and neither is the name a
. The error only complains about x
as this name appears first. You understand very well what x is when you have your maths hat on. However, when you put your Python hat on, you’ll realise that Monty, your computer program, has no idea what the name x
refers to. He knows that y
will be the label of a new box since it’s on the left of the assignment operator, but x
and a
are not names he can find anywhere in the White Room.
You first need to create two new boxes labelled x
and a
so that the computer program can access the relevant numbers. The name a
refers to a constant in the equation. You can therefore assign a value to a
. Let’s say that a
will be equal to 5
.
In the mathematical equation shown above, x represents the real number line. Therefore, x represents every possible number. Even if you limit x to a specific range, let’s say between 1 and 10, x still represents an infinite number of values in that range. So how can you deal with this in a Python program?
The answer is that you can’t. But you can approximate this by creating a discrete version of the real number line. I’ll get back to this shortly, but first, you need to learn about two useful functions in NumPy.
NumPy’s arange()
and linspace()
One of the built-in Python functions you’ve used several times already is range()
. So far, you’ve used this function in its simplest form with only one argument. However, you can also use it with two or three arguments. To view the numbers within the ranges you create, you can convert them into a list:
>>> list(range(10)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> list(range(3, 10)) [3, 4, 5, 6, 7, 8, 9] >>> list(range(3, 10, 2)) [3, 5, 7, 9]
When you pass two arguments into range()
, you define both the start and the end of the range of numbers. You’ll recall that in Python, the start value of the range is included when you define a range, but the end value is not. When you use the third argument, it specifies the step size. In the final example, the numbers range from 3
up to but excluding 10
, in steps of 2
.
The built-in range()
function only works with integers. This is where NumPy comes in. NumPy has a function called arange()
which is similar to the built-in range()
, but deals with floats as well as integers:
>>> import numpy as np >>> np.arange(10) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.arange(3, 10) array([3, 4, 5, 6, 7, 8, 9]) >>> np.arange(3, 10, 1.5) array([3. , 4.5, 6. , 7.5, 9. ]) >>> np.arange(3.2, 10, 0.8) array([3.2, 4. , 4.8, 5.6, 6.4, 7.2, 8. , 8.8, 9.6])
The start, stop, and step values can now all be floats.
A related function in NumPy is called linspace()
. When you use linspace()
, you still define the start and stop values of the range, but the third argument represents the number of points you’d like to have in your sequence of numbers:
>>> np.linspace(0, 10, 20) array([ 0. , 0.52631579, 1.05263158, 1.57894737, 2.10526316, 2.63157895, 3.15789474, 3.68421053, 4.21052632, 4.73684211, 5.26315789, 5.78947368, 6.31578947, 6.84210526, 7.36842105, 7.89473684, 8.42105263, 8.94736842, 9.47368421, 10. ])
You create an array to represent the range from 0
to 10
, and you also set the number of points to be 20
. Note that linspace()
goes against the convention in Python, and the endpoint is included in the array. The linspace()
function calculates the step size needed from the start and stop values and the number of points in the array.
NumPy’s arange()
and linspace()
are similar. When you need to set the step size of the range, you’ll use arange()
. When the number of points in the array is the number you want to control, you can use linspace()
.
Translating equations into Python
It’s time to get back to the equation you were studying earlier. You’re now ready to create a discrete version of the term x:
import numpy as np a = 5 x = np.linspace(1, 10, 1000) y = np.sin(x-a)/x print(y)
You define a
to be the integer 5
and x
as the array created by linspace()
containing the range of numbers between 1
and 10
. There are 1000
points in the array x
.
The line defining the variable y
no longer has any errors as both a
and x
are defined. Note that NumPy has its own version of sin()
, so you no longer need to import math
, but instead, you can use np.sin()
to calculate sine.
Since x
is an ndarray
, y
will also be an ndarray
. Each item in y
will represent the equivalent item in x
transformed through the equation shown above.
I’m not showing the output from this code, but you can see the array stored in y
displayed when you run this code. However, looking at a long sequence of numbers is often not very informative. You can show y
in a different form to help you understand the relationship this equation represents.
You’ll read about visualising data using Matplotlib in more detail in a later Chapter. In this section, I’ll present the code needed from Matplotlib without dwelling too much on it. As Matplotlib is a third-party library, you’ll need to install it in the same way you installed NumPy earlier in this Chapter. Note that, as with most modules, the name of the module, matplotlib
, is all lowercase:
import numpy as np import matplotlib.pyplot as plt a = 5 x = np.linspace(1, 10, 1000) y = np.sin(x-a)/x plt.plot(x, y) plt.show()
You use two functions from matplotlib.pyplot
, which you import using the alias plt
. These functions plot the relationship between x
and y
and show the plot on the screen. The output from this code is the following figure:

How does the value of a
change the appearance of this plot?
Creating a video animation
Instead of defining a
as a single value, you can explore a range of values of a
:
import numpy as np import matplotlib.pyplot as plt x = np.linspace(1, 10, 1000) for a in np.arange(-10, 10.1, 0.1): y = np.sin(x-a)/x plt.clf() plt.plot(x, y) plt.ylim([-1, 1]) plt.title(f"a = {a:.1f}") plt.pause(0.01) plt.show()
You iterate using a for
loop, but instead of using the built-in range()
function in the for
statement, you’re using NumPy’s arange()
. The variable a
will go through all the values between -10
and 10
in steps of 0.1
. Note that the argument you use to indicate the end of the range is 10.1
when you call arange()
. This ensures that 10.0
is included in the range.
The plotting functions from Matplotlib you use in the for
loop now include:
clf()
clears the last figure before you plot the next one. Without this function, all plots will be shown on top of each other in the figure. You can try commenting this line out to see what happens.ylim()
sets the range for the y-axis of the plot. Without this function, the program automatically determines the range for the y-axis, but it will be different for each plot.title()
adds text to the title bar of the figure. You use this function to show the value ofa
as it changes.pause()
updates the plot and introduces a delay of0.01
seconds to allow you to view the plot before the next one is shown. You can think of this as the frame rate for the animation you’re displaying with thefor
loop.
The Matplotlib function show()
is outside the for
loop. Depending on the exact coding environment you’re using, if you don’t include the call to show()
, the figure may close immediately once the for
loop finishes all iterations.
This code will create the following output:
You now have a better understanding of the equation you’re exploring and how changing its parameters affects the output.
Conclusion
NumPy is one of the most widely used packages in Python. You’ll need to use NumPy for any quantitative application you work on. Other packages build on NumPy and use the ndarray
data type.
In this Chapter, you’ve learned about:
- Installing third-party modules in Python
- Using NumPy’s
ndarray
data type - Performing basic statistical operations on data
- Filtering data using NumPy
- Representing and exploring equations using NumPy
There’s a lot more you can do with NumPy. This Chapter set out the main principles of this module and how to get started. You’ve worked on some examples that show how you can use NumPy to start analysing datasets or to represent and explore mathematical equations. You’ll use NumPy again later in this book.
Additional Reading
- You can read the article on creating images from sine functions which uses NumPy: Understanding Images using the 2D Fourier transform in Python. No prior knowledge of Fourier transforms is needed.
Snippets
1 | Looking For Help When Coding
This book will not tell you everything you need to know about coding. No book can. The adage "the more you know, the more you realise how much you don’t know" is very appropriate for coding. Even when you become an expert, you’ll still come across new modules you haven’t used before, which you’ll need to learn how to use.
An essential skill when coding is to know how and where to ask for help. Of course, there’s the Codetoday Forum you can join, where you can ask questions. However, in this section, I’ll focus on the more general ways to get help.
Using Python’s help()
function
The first port of call, which is also the easiest to use, is to use Python’s built-in help()
function. You can use this function with any name you would normally use in Python. However, this doesn’t mean that there’s always help available. Let’s look at a few examples:
>>> help(print) Help on built-in function print in module builtins: print(...) print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False) Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.
You use help()
with a built-in function in this case. When looking for help with functions, you’ll be presented with the function signature and often a description of what the function does and what arguments you need. If a function returns some data, the help text will generally describe this too. From the signature and the help text, you’ll be able to identify which arguments are required and which are optional and what data types are required for each argument.
The amount of detail you get can vary. For example, here’s the help for a function you’ve used several times already:
>>> import random >>> help(random.randint) Help on method randint in module random: randint(a, b) method of random.Random instance Return random integer in range [a, b], including both end points.
The function description is a single line. However, it still contains all the information you need to use this function. Note that the description states that the function returns a random integer. Therefore, you also know the data type of the object returned.
You can even get help on classes and modules, which will normally include plenty of detail on what’s included in them. You can try help(random)
or help(str)
, for example.
If you want help with methods of a class, you can either create an instance of that class or reference the function directly from the class name:
>>> my_name = "Stephen" >>> help(my_name.upper) Help on built-in function upper: upper() method of builtins.str instance Return a copy of the string converted to uppercase. >>> help(str.upper) Help on method_descriptor: upper(self, /) Return a copy of the string converted to uppercase.
The arguments my_name.upper
and str.upper
both ultimately refer to the same method. Note that the help displayed is slightly different as technically, these are not exactly the same object.
Using function names as arguments for help()
You’ve been using NumPy, and therefore, you can read the help for functions you’ve learned about in this Chapter. Recall that you’ll need to import the module first, and if you use the alias np
, you’ll need to use help(np.linspace)
, for example.
When you use a function as the argument for help()
, you’re not calling the function but just using the function name. There are no parentheses following the function name you’re looking for help on. What would happen if you did include the parentheses?
>>> help(random.randint()) Traceback (most recent call last): File "<input>", line 1, in <module> TypeError: randint() missing 2 required positional arguments: 'a' and 'b' >>> help(print()) Help on NoneType object: class NoneType(object) | Methods defined here: | | __bool__(self, /) | self != 0 | | __repr__(self, /) | Return repr(self). | | ---------------------------------------------------------------------- | Static methods defined here: | | __new__(*args, **kwargs) from
In the first example, an error was raised. You’re calling random.randint()
with no arguments, but randint()
needs two required arguments, as the TypeError
shows you.
However, when you used print()
as the argument for help()
, you did get some help. When you start reading the help, you realise this is not the help you may have expected. The help shown is that for the class NoneType
because the print()
function returns None
. This is therefore identical to help(None)
.
The help you get is whatever the person who wrote the function or class decided to put in the docstring. For example, if you create a function without a docstring, help()
will only show you the function signature:
>>> def my_func(param_1, param_2, param_3=None): ... print("This function doesn't do anything!") >>> help(my_func) Help on function my_func in module __main__: my_func(param_1, param_2, param_3=None)
However, if you include a docstring in your function definition, help()
will display the text in the docstring:
>>> def my_func(param_1, param_2, param_3=None): ... """ ... This is a function that may not be very useful, to be honest ... The input parameters are param_1, param_2, and param_3 (optional) ... However, the parameters don't actually do anything! ... ... Good-Bye ... """ ... print("This function doesn't do anything!") >>> help(my_func) Help on function my_func in module __main__: my_func(param_1, param_2, param_3=None) This is a function that may not be very useful, to be honest The input parameters are param_1, param_2, and param_3 (optional) However, the parameters don't actually do anything! Good-Bye
One of the problems with using help()
is that you need to know the name of the function or other object you need to use.
Using Python’s documentation
Sometimes, displaying help in your IDE is not the most convenient way of reading about a function or a class. The official Python documentation is the next place to go. Here, you can find details about every aspect of Python and its standard library.
For example, if you type in "upper" in the search bar at the top of the page, you’ll be presented with a number of functions called upper
or that have the word "upper" within them. One of them is str.upper
which you can click on.
The description you get for str.upper()
in the online documentation is a bit more detailed than the output you got from the help()
function. This may not always be the case, though. If you search for the help for randint
, you’ll again get a one line description.
Reading the documentation is not ideal bedtime reading, but when you’re learning about new functions and new modules, browsing through the documentation can give you an idea of what things are available. You can then come back to the documentation for specific functions as and when you need them.
Using third-party package documentation
You could find the documentation for randint
in Python’s online documentation because the random
module is part of the standard library. If you’re using NumPy, for example, and you’d like to find documentation for one of its functions, you won’t have any success in Python’s docs.
Python has a vast number of third-party modules. Some of the major ones have excellent online documentation. If you’re using NumPy, for example, you’ll want to bookmark the NumPy documentation site. You can now search for linspace
here, and you’ll find there are several occurrences of functions called linspace
in various sub-modules in NumPy. But you’re looking for numpy.linspace
, which takes you to a very comprehensive documentation page. Many pages also include examples at the bottom of the page.
You can even scroll through the sidebar of the reference pages to get a sense of what other functions are available. However, packages such as NumPy are quite large, so I wouldn’t advise reading through the whole documentation from beginning to end!
Later in this book, you’ll use two further third-party modules that are very well-established, and that also have excellent online documentation. These are Matplotlib and Pandas.
Using online forums
Documentation is very useful. Whether you use the documentation provided by help()
or the online documentation, you won’t be able to write code without referring to these resources. However, you’ll need to know what you’re looking for to get help from any of these sources or you’ll need to scroll through pages of documentation to find function names that may be relevant for the task you’re trying to solve.
Another useful resource when coding is Google–other search engines are also available! As with any other searches on search engines, how you write your query is important and affects your results. But you already know how to use search engines! You should include the term "Python" in your searches to avoid getting solutions relevant to other programming languages.
When you submit a query, such as "How do I create a random number in Python?" you’ll often get top links to several sites that will often appear in your searches. One of the top hits may very well be Python’s own documentation. But you’ll also find links to other forums where programmers ask questions to other programmers. The most known of these is probably StackOverflow which already has a vast number of questions that have been answered. Someone has almost certainly already asked most questions you’ll have in the early and intermediate days of learning to code. All you have to do is find the relevant question, which your search engine will help you with!
Newbie beware!
A word of warning: some of these sites are renowned for not being very beginner-friendly, with some responses being rather curt and dismissive. The good news is that you won’t need to ask too many questions initially but simply read questions and answers already available. If you do choose to ask a question, don’t be put off by the tone of some answers!
It’s also worth pointing out something you know already. The internet is full of the good, the bad, and the ugly. When it comes to coding, there’s a lot of bad and ugly, but there’s also a lot of good buried there, too. Don’t assume that the top response in these forums is necessarily the best one. In most instances, scrolling through a few responses can be more useful than simply stopping at the first one.
The same applies to articles and tutorials you may find on any topic. Because it’s on the internet, it doesn’t mean it’s any good or that it follows best practices! But there are many good resources online, too.
Another site you’ll find listed often in searches is Real Python. [Disclaimer: I’m a regular guest author on Real Python, but I don’t get anything out of recommending the site to you! The reason I’m recommending it is the same reason I’m an author on the site, and that’s because it has high-quality articles and tutorials on most topics in Python.] When you see a Real Python article listed in your search results, don’t look any further; just click on it and start reading! Indeed, if you’re looking for help on linspace
, which is one of the functions you’re likely to need when using NumPy, which you’ve read about in this Chapter, you’ll find my very first contribution to Real Python listed in your search results!
2 | Different Ways of Importing Modules
You’ve used the import
keyword from the very early days of learning Python. In the White Room analogy, you compare this to bringing a book with lots of instructions from a library into your program and putting it on the shelf for when you need it.
You’ve already seen two slightly different versions of using import
. The first is also the most straightforward:
import random
You import the whole module called random
. Your program will look for a file called random.py
. There are a few places Python will look for this file, including your project folder and the location of your Python installation. Note that if you name a file called random.py
in your project folder, when you write import random
in another file in the same project, Python will try to import your own file and not the one in the standard library. Beware of naming files using names of existing modules.
Once you import a module, you can refer to it using its name and then use the dot to get an object from within this module.
You’ve also used another version of this method when you imported modules using the as
keyword to give them an alias:
import random as feeling_lucky
You’ve chosen to rename the book you fetched from the library using a different name. You will now need to use the new name in the script with this import statement.
Using the from
keyword
There is yet another way of importing from modules:
from random import randint
You’re no longer importing the whole module. Rather than fetching the whole book, you’re now only tearing out the page with the instructions for randint
on it. The name random
is no longer defined in your script since you didn’t import the entire module. However, the name randint
is part of the main namespace, and you can use it directly without the need for the module name to precede it.
You can even import several objects from the same module:
from random import randint, choice
There is another option that should, however, be used very, very rarely, if ever:
from random import *
You’re now importing every individual name defined in random
separately. Rather than bringing one book called random
and then using the name random
whenever you need something from the book, you’ve now torn off every single page from the book and put each page separately on the shelves in your White Room. This creates a mess on your shelves and in your program. Don’t do it.
When you import the module, you’re keeping its contents within a separate namespace. Therefore the functions randint
and choice
, say, are still names within the random
namespace and separate from other names that exist in your program. When you import from a module, you’re adding those names to your main namespace, mixing them up with other names you already have.
Dangers of using the from
option
There is a danger when importing from a module rather than importing the entire module. Here’s an example of what can go wrong.
A colleague shares the following code with you (I’m keeping this code simple for demonstration purposes; it’s unlikely you would actually need a colleague to share this code with you in the first place!):
for _ in range(20): print(randint(1, 3))
You run this code and get a NameError
. Then you realise that your colleague is using randint
in the code. The function randint
is not preceded by a module name which tells you that your colleague imported from a module. Therefore, you add the import:
from random import randint for _ in range(20): print(randint(1, 3))
This gives the following (or similar) output:
1
2
3
2
1
3
1
3
3
1
2
3
1
1
2
1
2
3
1
3
However, that’s not the code your colleague wrote. This is the full version of the colleague’s code:
from numpy.random import randint for _ in range(20): print(randint(1, 3))
NumPy has its own random
sub-package which has a randint()
function. The randint()
in random
is a different function to the randint()
in numpy.random
, even though they share the same name. The output from this code is the following:
2
1
1
2
2
1
1
2
1
2
2
2
2
1
2
2
2
1
1
2
Do you spot the difference? The randint()
function in numpy.random
follows Python’s convention of excluding the endpoint of a range. Therefore the number 3
is not included in the output. However, the randint()
function in the built-in random
module includes the endpoint.
This can lead to bugs that are very hard to find and fix. To avoid name clashes, you can import entire modules rather than using the from
keyword.
Reading Data from a CSV File Without Using NumPy
Earlier in this Chapter, you used the HadCET dataset from the Met Office. As you were using NumPy, you used one of NumPy’s functions to read the data from the CSV file and create a two-dimensional numpy.ndarray
. In this Snippet, you’ll learn about two other options for reading data from the CSV file that do not rely on NumPy or other third-party modules.
You’ll use the same CSV file you used earlier, mean_daily_temperatures_from_1772.csv
, which you can download from The Python Coding Book File Repository. You should make sure the CSV file is in your project folder.
Version 1: From first principles
You learned how to open a file from within a Python program in Chapter 4 when you worked on the Pride & Prejudice project. You’ll use the with
statement to open the file:
>>> with open("mean_daily_temperatures_from_1772.csv") as file: ... raw_data = file.readlines() ... >>> type(raw_data) <class 'list'> >>> len(raw_data) 7750 >>> raw_data[0] '1772,1,32,-15,18,25,87,128,187,177,105,111,78,112\n' >>> type(raw_data[0]) <class 'str'>
You assign the variable name file
to the open file, then you use the method readlines()
. You can read the documentation that’s returned when you call help(file.readlines)
:
help(file.readlines)
Help on built-in function readlines:
readlines(hint=-1, /) method of _io.TextIOWrapper instance
Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more
lines will be read if the total size (in bytes/characters) of all
lines so far exceeds hint.
The key line you’ll need is the description that says that the method returns a list of lines. You can ignore the rest for now. In the code above, you can see that raw_data
is a list that contains 7750
items. This number is equal to the number of lines in the CSV file.
The first item in the list, raw_data[0]
, contains the first line of the CSV file. However, the data is in the form of one long string. Although you, as a human, can identify each element of this line as a unique data point, the computer program only sees this as one long string of characters. Note that the final character of this string is the newline character "\n"
.
You can use the string method split()
to deal with this:
>>> raw_data[0].split(",") ['1772', '1', '32', '-15', '18', '25', '87', '128', '187', '177', '105', '111', '78', '112\n']
You use the split()
method using the argument ","
so that the method returns a list with each item determined based on wherever there’s a comma in the string. Each item of this new list is still a string, but now, each cell of the original CSV file is shown as a separate item.
Extracting all the data
Now, you have all the tools you need to extract the data from each cell. You can store the final data in a list of lists:
>>> final_data = [] >>> for row in raw_data: ... final_data.append(row.split(",")) ... >>> final_data [['1772', '1', '32', '-15', '18', '25', '87', '128', '187', '177', '105', '111', '78', '112\n'], ['1772', '2', '20', '7', '28', '38', '77', '138', '154', '158', '143', '150', '85', '62\n'], ['1772', '3', '27', '15', '36', '33', '84', '170', '139', '153', '113', '124', '83', '60\n'], ['1772', '4', '27', '-25', '61', '58', '96', '90', '151', '160', '173', '114', '60', '47\n'], ...
I have truncated the output shown above, but you’ll be able to see the full list of lists displayed when you run this code. This output is a list containing lists, one for each row in the CSV spreadsheet. You can achieve the same result using a list comprehension. You can then explore some items from the list of lists:
>>> final_data = [row.split(",") for row in raw_data] >>> final_data[2][6] '84' >>> int(final_data[2][6]) 84 >>> final_data[0][13] '112\n' >>> int(final_data[0][13]) 112
In the first example, you display the item in the third row (index 2
) and the seventh column (index 6
). The output is still a string. However, you can convert this into an integer using the int()
built-in function. In the second example, you display the last item in the first row. The string includes the newline character "\n"
. However, the int()
function can deal with this. Therefore, you don’t need to worry about the newline character in this case.
You have brought in the data in the cells of the CSV spreadsheet into a list of lists using first principles, without the need to import any module.
Version 2: Using the csv
module
In this version, you’ll use the csv
Python module which is part of the standard library:
>>> import csv >>> with open("mean_daily_temperatures_from_1772.csv") as file: ... raw_data = csv.reader(file) ... >>> raw_data <_csv.reader object at 0x7f8bba76beb0>
You’ve now used the function reader()
from the csv
module with the open file as its argument. However, raw_data
is not a list as in the previous version. The reader()
function returns an reader
object. This object is an iterator, which means that you can loop through it:
>>> with open("mean_daily_temperatures_from_1772.csv") as file: ... raw_data = csv.reader(file) ... final_data = [row for row in raw_data] ... >>> type(final_data) <class 'list'> >>> type(final_data[0]) <class 'list'> >>> final_data[0] ['1772', '1', '32', '-15', '18', '25', '87', '128', '187', '177', '105', '111', '78', '112']
Note that as the iterator returned by reader()
is a generator and doesn’t contain the data yet, you need to create final_data
in the with
block while the file is still open. Outside of the with
block, the file will be closed, and the iterator won’t be able to retrieve the data. When using the csv
module, you no longer need to use split()
as the items from the iterator are already lists. The newline character has also been removed.
There are other ways you can read data from a CSV file. You’ve learned about three methods in this Chapter. You’ll use a fourth method when you learn about the Pandas library in a later Chapter.
Coming Soon…
The Python Coding Place
Sign-Up For Updates
The main text of the book is complete—I’m planning some additions and exercises and I’ll make this book available in other formats soon. Blog posts are also published regularly.
Sign-up for updates and you’ll also join the Codetoday Forum where you can ask me questions as you go through this journey to learn Python coding.