4 | Data, Data Types and Data Structures

You’ve learnt how to do quite a few things in Python in the first three chapters. You’ve seen how most things that happen in a computer program can be categorised as store, repeat, decide, or reuse. In this Chapter, you’ll focus on the store category. You’ll find out more about data types and why things aren’t always as simple as with the examples you’ve seen so far.

The simplest description of a computer program is the following one:

A computer program stores data, and it does stuff with that data.

Granted, this is not the most detailed and technical definition you’ll find, but it summarises a program perfectly. You need information in any computer program. And then you need to do something with that information.

No computer program can exist without data. The data could be a simple name or number as in the Angry Goblin game, or you could have a very complex and highly structured data set. And if you’re using a programming language, you’ll want to perform some actions with the data and transform it in one way or another.

Deciding how you want your program to store the data is an important decision you’ll need to make as a programmer. Languages such as Python provide many alternatives for you to consider. In this Chapter, you’ll read about how data types are handled in Python and about the different categories of data types.

The first part of this Chapter will cover some of the theory related to data types. In the second part, you’ll work on a new project in which you’ll work out all the words that Jane Austen used when writing Pride and Prejudice and how often she used each word. Can you guess the five most common words in Pride and Prejudice? You’ll find out the answer when you complete the project later on in this Chapter.

Some Categories of Data Types

You’ve already come across several of the most important data types in Python. You’ve used integers and floats, strings, Booleans, and lists. There are many more data types in Python. Many, many more!

As you learn about more data types, it is helpful to learn about the categories of data types. This is because some data types may have similar properties, and when you know how to deal with a category of data type, you’ll be better equipped to deal with a new data type you’ll encounter from the same category.

In this section, you’ll learn about sequences, iterables, mutable and immutable data types. Don’t be put off by the obscure wording. Like in every other profession, programmers like to use complex-sounding words to make the subject look difficult and elusive. Cannot-be-changed doesn’t sound as grand as immutable! My aim in this book is the opposite, but we cannot escape using the terms you’ll find in documentation, error messages and other texts.

Iterable Data Types

You’ve already used iteration in Python. When you repeat a block of code several times using a for loop, you are iterating the block of code. You’ll often hear of the first iteration or the second iteration of a loop, say, to refer to one of the repetitions.

A data type is iterable if Python can go through the items in the data type one after the other, such as in a for loop. You can experiment with the data types you’ve encountered so far, for example:

a_number = 5
another_number = 5.3
my_name = "Stephen"
many_numbers = [2, 5, 7, 23, 1, 4, 10]
more_numbers = range(10)

You’ll recall that you can always check the data type by using the type() function:

print(type(a_number))
print(type(another_number))
print(type(my_name))
print(type(many_numbers))
print(type(more_numbers))

The output you’ll get from printing the data types for these five variables is:

<class 'int'>
<class 'float'>
<class 'str'>
<class 'list'>
<class 'range'>

You’re already familiar with the first four of these. You’ll recall that you used the range() function when you first learned about the for loop. To repeat some code ten times, for example, you wrote:

for item in range(10):
    print(item)

This code will simply print out the numbers 0 to 9. Now you know more about functions, you’ll recognise range() as a function. All functions return some data when they finish executing. The variable more_numbers collects this data. The range() function doesn’t return a list, as you may perhaps expect. Instead, it returns another data type called a range object. We won’t worry too much about this data type for now, but I’ve included it here as it’s one you’ve come across already.

The following code is identical to the for loop you’ve written above, as more_numbers contains the data returned from range(10):

for item in more_numbers:  # Same as for item in range(10):
    print(item)

You can explore which data types are iterable by trying to use each variable in a for loop:

for item in a_number:  # a_number is an int
    print(item)

The variable a_number is an int, and when you try to use it in a for loop, you’ll get the following error message:

Traceback (most recent call last):
  File "<path>/<filename>.py", line 13, in <module>
    for item in a_number:
TypeError: 'int' object is not iterable

The last line of the error message has all the information that you need. The error is a TypeError. Something is wrong with the data type. The error message then goes further by clarifying the int object is not iterable. Therefore, an item of data that’s an integer cannot be used as an ordered sequence in a for loop.

You’ll get the same error message when you try the float data type. A float is not iterable either. However, lists and strings are both iterable, as the following examples show:

for item in my_name:  # my_name is a str
    print(item)

for item in many_numbers:  # many_numbers is a list
    print(item)

You’ll get no errors saying that these data types are not iterable in this case:

S
t
e
p
h
e
n
2
5
7
23
1
4
10

When you iterate through a string, each character is considered one at a time. So the variable item will contain the letter "S" in the first iteration of the for loop, then t in the second iteration, and so on.

You may have noticed that you used the variable item in both of the for loops. This is fine as long as you only need to use this variable within the loop. When the computer program reaches the second loop, it will overwrite whatever is already in the variable item with the new values. You’re reusing the same box to save having to get a new box!

In the for loop statement, you don’t necessarily need to use a variable name at the end of the line. You’ve already seen how you can use a function such as range(). As long as the function returns an iterable data type, it can be used directly in the for loop statement. You can even use the data structure directly, for example:

for item in "Stephen":
    print(item)

for item in [2, 5, 7, 23, 1, 4, 10]:
    print(item)

Another category of data types is a sequence. A sequence is a data type that has ordered items within it. You’ve already seen for example how we can use indexing on both lists and strings—both sequences— to extract an item from a specific position, for example:

>>> my_name = "Stephen"
>>> my_name[2]
'e'

>>> many_numbers = [2, 5, 7, 23, 1, 4, 10]
>>> many_numbers[0]
2

There is a lot of overlap between data types that are sequences and those that are iterables. For the time being, it’s fine if you want to think of these two categories as the same, but for completeness, I’ll clarify that they’re not the same. There are some iterable data types that are not sequences.

Mutable and Immutable Data Types

Let’s keep using lists and strings in this section. You’ve seen that both these data types are iterable, and they are sequences. In the last code example in the previous section, you’ve seen how you can extract an item from either a list or a string based on the position within the data structure.

You can now try to reassign a new value to a certain position. Let’s start with lists:

>>> many_numbers = [2, 5, 7, 23, 1, 4, 10]
>>> many_numbers[2] = 2000
>>> many_numbers
[2, 5, 2000, 23, 1, 4, 10]

In the second line, you’ve reassigned what’s being stored in the third place within the list, since the index 2 refers to the third spot in the list. Whatever value was stored in the third position of the list has been discarded and replaced with 2000. When you display the list again, you can see that the value 7, which was in the third place in the list, is no longer there. The value 2000 is in its place.

Now you can try to do the same with a string:

>>> my_name = "Stephen"
>>> my_name[2] = "y"
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: 'str' object does not support item assignment

The trick that worked for lists does not work with strings. You get a TypeError again. Strings are immutable data types. This means that once they’ve been created, you cannot make changes within the string. Immutable data types are meant for information that is unlikely to need to change once created.

Lists are mutable, which means that they are flexible containers that can change. You can add new items to a list or replace existing ones. Mutable data types are ideal for data that is likely to change while a program is running.

Note that even with immutable data types, you’re still allowed to overwrite the entire variable. So if I did want to change the third letter of my name to "y", I would need to do the following:

>>> my_name = "Stephen"
>>> my_name = "Styphen"
>>> my_name
'Styphen'

In this case, you have reassigned new data to the variable my_name. You’re not modifying an existing string, but you’re discarding the old string and replacing it with a new one that you assign to the same variable name. In the next section, you’ll see how understanding the difference between mutable and immutable data types can help explain other behaviour that may seem odd at first glance.

Methods Associated With Data Types

There is a lot of terminology in coding. Here’s a new term you haven’t encountered so far: method. If you know what a function is, and you do from the previous Chapter, you also know what a method is. A method is a function that is associated with a specific data type and acts on that data type. This description will make a lot more sense with some examples.

Let’s start with standard functions. You’ve seen print() and range() are functions. They are written in lowercase letters and have parentheses at the end. Functions can have arguments in the parentheses, and they return data when they complete their actions.

You’ve also already seen one method being used in the previous Chapter. When you created a list and then you wanted to add items to the list you used append(). Let’s remember how you can use append():

>>> many_numbers = [2, 5, 7, 23, 1, 4, 10]
>>> many_numbers.append(5000)
>>> many_numbers
[2, 5, 7, 23, 1, 4, 10, 5000]

The value 5000 was added to the end of the list. Note that you didn’t write append() as a standalone function, as you would use print(), say. Instead, you attached it to the variable name with a full stop or period. The name append() is all lowercase and it’s followed by parentheses as is the case with all functions. The value 5000 is the argument that is passed into this function. When a function is associated with a data type and attached to it with the full stop, we call it a method.

Methods behave in the same way as functions. The only difference is that in addition to any data included in the parentheses as arguments, methods also have direct access to the data stored in the object they are attached to. In the example above, the method append() took the list stored in many_numbers, found the next available spot in the list and added the value 5000 to that list.

If you’re using an IDE such as PyCharm, you’ll have noticed by now that these tools have autocompletion to make writing your code easier. If you start typing pri the IDE will show you print(), and you can simply press Enter/Return to accept the autocompletion. You may have noticed that when you type a variable name followed by a full stop or period, your IDE will show you a list of names that are available for you to use. You’ll find all the methods you can use listed here.

As many_numbers is a list, when you type the full stop, you’ll see all the methods associated with lists. Whenever you have a list, you’ll always have access to the list methods.

Let’s look at a few more list methods:

>>> many_numbers.remove(23)
>>> many_numbers
[2, 5, 7, 1, 4, 10, 5000]

The list many_numbers had the value 23 stored in the fourth position. The remove() method deleted this value. If you want to delete a value by position within the list, instead of by value, you can use the pop() method:

>>> many_numbers.pop(0)
2
>>> many_numbers
[5, 7, 1, 4, 10, 5000]

You have removed the first item from the list by using the index 0 as an argument in pop(). This method not only alters the original list but also returns the value that has been removed. The method is popping a value out of the list, which you can store separately in another variable if you wish:

>>> the_number = many_numbers.pop(0)
>>> the_number
5
>>> many_numbers
[7, 1, 4, 10, 5000]

Note that as you’re using the same list repeatedly, the list is now shrinking as you’ve used remove() once and pop() twice already. Let’s look at one final list method for now. First, you can add an extra value to the list:

>>> many_numbers.append(4)
>>> many_numbers
[7, 1, 4, 10, 5000, 4]
>>> many_numbers.count(4)
2

The count() method counts how many times the value in the parentheses is in the list. As there are two occurrences of the value 4, this method returns the value 2.

These methods are list methods. They are defined to work on lists. Let’s have a look at some string methods now:

>>> my_name = "Stephen"
>>> my_name.upper()
'STEPHEN'
>>> my_name.lower()
'stephen'
>>> my_name.replace("e", "-")
'St-ph-n'

As before with lists, if you’re using an IDE and you type the name of a string followed by a full stop or period, you’ll see all the methods that are associated with strings.

Revisiting Mutable and Immutable Data Types

Let’s go back to the last examples you’ve been trying out:

>>> my_name = "Stephen"
>>> my_name.upper()
'STEPHEN'
>>> my_name
'Stephen'

The variable my_name did not change when you used the upper() method. You can try this with the other string methods you’ve used above, and you’ll notice the same pattern.

The upper() method, and the other string methods, return a copy of the string but they don’t change the original variable. However, this wasn’t the case with the list methods. Let’s look at one example you’ve used above:

>>> many_numbers = [2, 5, 7, 23, 1, 4, 10]
>>> many_numbers.append(5000)
>>> many_numbers
[2, 5, 7, 23, 1, 4, 10, 5000]

The list method append() does change the original data that’s stored in the variable. At first sight, this different behaviour between string methods and list methods may seem an anomaly. In one of the sets of methods, the methods modify the original data, while in the other set, the methods don’t make any changes but instead return a copy.

However, there’s a reason why string methods and list methods behave differently. Strings are immutable. They are not meant to change once they’ve been created. For this reason, the string methods do not modify the original string but instead return a copy.

If you wanted to replace the old string with the new one returned by the method, you’ll have to do so explicitly:

>>> my_name = "Stephen"
>>> my_name = my_name.upper()
>>> my_name
'STEPHEN'

You assign the copy of the string returned by upper() to the same variable name my_name. You are therefore overwriting the original variable. This extra step is forcing you, the programmer, to take responsibility for this action. You are confirming that you want to replace the string with the new one.

As lists are mutable, their methods act directly on the original data. A common bug occurs in programs when a programmer forgets about this distinction and either uses reassignment on list methods or doesn’t use reassignment when using string methods but is then expecting the variable to contain the new data.

Data Structures

You’ve encountered several data types already. Some data types, called data structures, store a collection of values. Python has three basic data structures. You’ve already used one of these, the list. A list is a sequence and an iterable, and it’s mutable. You’ll shortly learn about the other two basic data structures in Python: tuples and dictionaries.

Python has many other data types, including data structures. In the second part of this book, you’ll learn about more advanced data structures used for dealing with quantitative datasets in science, finance, and other data-driven fields.

Tuples

The second of Python’s basic data structures is the tuple. You have a choice on how to pronounce this! Some pronounce the term as tup-el, rhyming with couple. Others pronounce it as two pill. Although tuple is not a common English word—it’s a term that appears in mathematics—it’s the same root that appears at the end of words such as triple, quadruple, or multiple. And you’d see the link between tuple and these words soon.

You can create a tuple in a similar way as you would a list. However, the type of bracket associated with a tuple is the parentheses instead of the square brackets:

>>> some_numbers = (3, 5, 67, 12, 3, 5)
>>> type(some_numbers)
<class 'tuple'>

You can create a list with the same numbers so that you can explore the differences and similarities between lists and tuples:

>>> same_numbers_in_list = [3, 5, 67, 12, 3, 5]
>>> type(same_numbers_in_list)
<class 'list'>

Tuples are also sequences and you can use indexing and slicing on tuples in the same way as you do with lists:

>>> some_numbers[1]
5
>>> some_numbers[2:4]
(67, 12)
>>> some_numbers[-1]
5

Notice that when you use indexing or slicing on any sequence, you’ll always use the square brackets immediately after the variable name, even for sequences that are not lists.

You can check whether tuples are iterable by using the tuple in a for loop and see whether you get an error:

>>> for item in some_numbers:  # some_numbers is a tuple
...     print(item)
...     
3
5
67
12
3
5

You’ve been able to iterate through the tuple some_numbers successfully. This means tuples are iterable.

You can now check for mutability. You can compare reassigning the value for one of the items in a tuple with the case when you’re using a list:

>>> same_numbers_in_list[2] = 1000
>>> same_numbers_in_list
[3, 5, 1000, 12, 3, 5]

When using a list such as same_numbers_in_list, you can reassign one of the list items. However, if you try to do the same thing with a tuple, you’ll get an error:

>>> some_numbers[2] = 1000
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

This is the same error you got when you tried to reassign a new letter into a string. Tuples, like strings, are immutable. A tuple is an immutable sequence.

At this point, you may be wondering why you need tuples at all. They seem to be like lists but with fewer features! However, when you want to create a sequence of items and you know that this sequence will not change in your code, creating a tuple is the safer option. It makes it less likely that your code will accidentally change a value in the tuple, as if your code tries to do so, you’ll get an error. Using a tuple when you know the data should not change means you’re less likely to introduce bugs in your code.

Tuples are also more memory efficient than lists, but you don’t need to worry about memory issues for the time being.

Although the round brackets, or parentheses, are the type of brackets associated with tuples, you can also omit the parentheses altogether when creating tuples:

>>> some_numbers = 3, 5, 67, 12, 3, 5
>>> type(some_numbers)
<class 'tuple'>

As with lists, you can store any data type in a tuple, including other data structures:

>>> another_tuple = (3, True, [4, 5], "hello", (0, 1, 2), 5.5)

This tuple contains six items:

an int
a bool
a list
a str
a tuple
a float

For the time being, you don’t need to worry about whether to use a list or a tuple. However, you’ll come across tuples as you code as Python uses this data structure often.

Dictionaries

The third basic data structure is the dictionary. Imagine you need to store the test marks for students in a class. You could do this with two lists:

>>> student_names = ["John", "Kate", "Trevor", "Jane", "Mark", "Anne"]
>>> test_results = [67, 86, 92, 55, 59, 79]

As long as the names and test results are stored in the same order, you can then extract the values in the same positions in each list. For example, if you extract the second item in each list, you’ll have the name "Kate" and her test result, 86.

Although you can do this, it’s not ideal. If you want to find Mark’s score, you’ll first need to find out which position in the list Mark occupies and then get the value that’s in the same position in the second list. And what if Trevor leaves the school and needs to be removed from the lists. You’ll need to make sure that his name and his mark are both removed from the two separate lists. Things will get even more complicated if you need to store the test marks for several subjects for each student instead of storing just the test mark for one subject.

In the White Room analogy, you’re creating two separate boxes, one labelled student_names and the other test_results. Although you are aware that these two boxes have data related to each other, the computer program does not know these boxes are linked. There’s a better way to store this kind of linked information: dictionaries.

You’ve already seen how the square brackets are associate with lists and the round brackets with tuples. Luckily, you still have more types of brackets left on your keyboard! It’s time to use the curly brackets now. You can create a dictionary with all the student names and marks:

>>> student_marks = {"John": 67, "Kate": 86, "Trevor": 92, "Jane": 55, "Mark": 59, "Anne": 79}
>>> type(student_marks)
<class 'dict'>

In addition to using curly brackets, you would have noticed another difference when creating a dictionary. Each item in a dictionary contains two parts separated by a colon. The dictionary student_marks contains six items, not twelve. The first item is the pair "John" and 67, and so on.

The first part of each item is called the key. The second part is called the value. So in the first item of the dictionary above, the key is "John", and the value is 67.

The order of the items in a dictionary is not important. This is different to lists, tuples, and strings, in which the order of items is a defining feature of the data structure. What matters in a dictionary is the association between the key and its value. If you try to access an item from a dictionary using the same indexing you’ve used for lists, tuples, and strings, you’ll get an error:

>>> student_marks[2]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
KeyError: 2

Trying to access the third item in the dictionary doesn’t make sense as a dictionary is not a sequence. Instead, we can use the key to access values from within the dictionary:

>>> student_marks["Trevor"]
92

Let’s see whether dictionaries are mutable:

>>> student_marks["Kate"] = 99
>>> student_marks
{'John': 67, 'Kate': 99, 'Trevor': 92, 'Jane': 55, 'Mark': 59, 'Anne': 79}

The answer is ‘yes’. Dictionaries are a mutable data type. You’ve changed the value associated with "Kate" to 99. You can even increment the value in a dictionary:

>>> student_marks["Anne"] = student_marks["Anne"] + 1
>>> student_marks
{'John': 67, 'Kate': 99, 'Trevor': 92, 'Jane': 55, 'Mark': 59, 'Anne': 80}

Python will look at what’s on the right of the assignment operator = where it will read the value associated with the key "Anne". The program will then add 1 to the integer 79. The result is then reassigned to the key "Anne".

What happens if you try to access the value of a key that doesn’t exist?

>>> student_marks["Matthew"]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
KeyError: 'Matthew'

The key "Matthew" doesn’t exist, so you get a KeyError. However, you can create a new key if you assign a value to it:

>>> student_marks["Matthew"] = 50
>>> student_marks
{'John': 67, 'Kate': 99, 'Trevor': 92, 'Jane': 55, 'Mark': 59, 'Anne': 80, 'Matthew': 50}

You can now have a look at some of the methods that you can use with dictionaries. You can start with a couple of methods that are not very exciting but can be very useful:

>>> student_marks.keys()
dict_keys(['John', 'Kate', 'Trevor', 'Jane', 'Mark', 'Anne', 'Matthew'])

>>> student_marks.values()
dict_values([67, 99, 92, 55, 59, 80, 50])

Another useful method is get() which allows you to get the value associated with a key. You can already do that using the square brackets notation, student_marks["John"], as you’ve seen above. However, get() doesn’t give an error if you try to access the value of a key that doesn’t exist:

>>> student_marks.get("Anne")
80
>>> student_marks.get("Zahra")

>>>

When you use "Anne" as the argument for get(), the method returned the value 80. However, nothing is returned when the argument is a key that doesn’t exist. You don’t get a KeyError either, though. There are times when you may want your code to be robust so that if you try to access a key that doesn’t exist, the code just carries on running without throwing an error. The get() method also allows you to put in a default value so that you can control what the method returns if the key doesn’t exist:

>>> student_marks.get("Anne", "There is no student with this name")
80
>>> student_marks.get("Zahra", "There is no student with this name")
'There is no student with this name'

How about for loops? Can you iterate through a dictionary?

>>> for stuff in student_marks:
...     print(stuff)
...     
John
Kate
Trevor
Jane
Mark
Anne
Matthew

You don’t get an error, but you can see that only some of the information has been stored in the variable stuff. When you iterate through a dictionary in this way, you’re iterating through the keys of the dictionary. There may be times you may wish to do this, but in most instances, when you want to iterate through a dictionary, you’ll want to do something different. You’ll learn about another way of iterating through dictionaries in the project you’ll start in the next section.

The Pride & Prejudice Project: Analysing Word Frequencies

It’s time to start working on a new project to consolidate lots of what you’ve learned so far. Your task is to read and analyse Jane Austen’s Pride and Prejudice. Except, you won’t be reading the novel! In this project, you’ll find all the words that Jane Austen used in the book and how often each one was used.

I mentioned earlier that you wouldn’t be reading the book. However, your computer program will. So, before you start working on the Pride & Prejudice project, you’ll find out how to read data from an external source.

You’ll need the file named pride_and_prejudice.txt for this project which you can get from The Python Coding Book File Repository. You’ll need to place the file in your Project folder.

Download The Python Coding Book File Repository

Through the link above, you can download the folder you need directly to your computer. I would recommend this option which is the most straightforward. But if you prefer, you can also access the repository through Github.

NOTE: As the content of The Python Coding Book is currently being gradually released, this repository is not final, so you may need to download it again in the future when there are more files that you’ll need in later chapters.

Making files accessible to your project

The simplest way to make sure you can access a file from your Python project is to place it in the project folder—the same folder where your Python scripts are located. If you’re using an IDE such as PyCharm, you can drag a file from your computer into the Project sidebar to move the file.

Alternatively, you can locate the folder containing your Python scripts on your computer and simply move the files you need in that folder as you would move any other file on your computer.

Tip: In PyCharm, if the Project sidebar is open you can click on the project name (top line) and then show the contextual menu with a control-click (Mac) or right-click (Windows/Linux). One of the options will be Reveal in Finder or Show in Explorer depending on what operating system you’re using.

Reading Data From a Text File

In all the programs you’ve written so far, all the data you used was data you typed into your program directly. This is rarely the case in real life. In many coding applications, your program will need access to data available in some other form. In the second half of this book, you’ll spend a lot of time looking at various ways of importing and accessing data from external sources. Here, you’ll look at one of the most basic yet very useful ways of importing data: reading from a text file.

The text of the book is in the text file pride_and_prejudice.txt. This is a normal text file that you can double-click on any computer and open and read with standard software on your computer. You’ll need to bring the contents of this file into your Python environment before you can start working with it.

Just as you can open a file in your computer’s operating system, you can open the file within your Python program. It’s time to open a new Python script and get started:

open("pride_and_prejudice.txt")

The function open() is a built-in function that does just what it says. However, you won’t see a new window open up in the same way as when you double-click the file on your computer. You’re used to the concept that many functions return something to the scope in which they’re called. So let’s store whatever is returned from open() in a variable and see what it is:

file = open("pride_and_prejudice.txt")

print(file)

The output from this code is not quite what you might expect:

<_io.TextIOWrapper name='pride_and_prejudice.txt' mode='r' encoding='UTF-8'>

Let’s ignore this for now. You can think of the object returned from open() as a handle to the file that you’ve just opened. The data type of this object is _io.TextIOWrapper. It’s best to ignore anything that looks obscure and unreadable for the time being! Instead, you can see what things you can do with the variable file by typing it in your script followed by a dot. If you’re using an IDE such as PyCharm, you’ll see a list of methods appear for you to choose from. One of these methods is read():

file = open("pride_and_prejudice.txt")

text = file.read()
print(text)

The method read() will, you’ve guessed it, read the contents of the file. You’ve stored the data returned by read() in another variable text, and when you print text, you’ll see the entire contents of the file output by your code. You can probably guess what data type text is:

file = open("pride_and_prejudice.txt")

text = file.read()
print(type(text))

You’ll find that read() returns a string. From this point onwards, the content of the text file is stored in your program as one of Python’s most basic data types. You have brought in data from the outside world into your Python program.

Note that the first and last lines of the text file are not part of the book but are the credit for this open-source ebook that we’re using. You can remove these lines from your version if you wish.

Before we proceed, there’s some housekeeping you need to do. You should never leave a file open longer than you need to. The simplest way to take care of this is by closing the file once you’ve brought the data into a Python variable:

file = open("pride_and_prejudice.txt")

text = file.read()
file.close()
print(text)

Although opening and closing the file using the open() built-in function and the close() method is the most straightforward way, recent versions of Python have introduced a better way to ensure that no files stay open. This uses the with keyword:

with open("pride_and_prejudice.txt") as file:
    text = file.read()

print(text)

The code above achieves the same goal as the previous version. The file is automatically closed once the code within the with block is executed. Feel free to use whichever option you want for the time being. Moving forward, you’ll want to shift to using the latter option, which is preferable.

The P&P Project: Reading and Cleaning the Data

From the point of view of your program, all you’ve got so far is a single string. A very long, single string. What you ideally need is the individual words. One of the Python string methods will come to your rescue for this task. You can explore this method in the Console:

>>> some_text = "This is a sentence which is stored as one single string"
>>> some_text.split()
['This', 'is', 'a', 'sentence', 'which', 'is', 'stored', 'as', 'one', 'single', 'string']

The split() method, which acts on a string, returns a list of strings in which each word is an individual item within the list. By default, split() separates the string based on where there are spaces in the string. You’ll see how to use split() in a more flexible way later in this book.

You can now use this method on the Pride and Prejudice text:

with open("pride_and_prejudice.txt") as file:
    text = file.read()

words = text.split()
print(words)

Have a look at the words in the list that’s printed out. I’m not showing the output here as Jane Austen wrote many words! Your task is to find all the words that have been used in the book. Do you mind whether a word is capitalised in some instances and not in others? No, not for the problem you’re trying to solve. How about whether a word comes at the end of a sentence and is followed by a full stop or period? Or perhaps a comma or other punctuation mark?

In any project in which you bring in data from an external source, you’ll need to clean the data before you can start using it. What’s required to clean the data will depend on what data you have and what you want to do with the data. In this case, cleaning the data means removing capitalisation and removing punctuation marks so that you’re left with just the lowercase words.

You can start by removing all capitalisation in the text. You’ve used the lower() string method earlier in this Chapter which changes a string into an all-lowercase copy. You have a list in which each item is a string. You can use the lower() method on each of these strings using a for loop.

However, there’s a simpler way. When you write a computer program, you’ll never write your code from line 1 to the last line in order. You’ll jump back and forth as you add code and make changes throughout your script. In this case, you’re better off returning to where you had a single, long string and apply the lower() method to that string:

with open("pride_and_prejudice.txt") as file:
    text = file.read()

text.lower()

words = text.split()
print(words)

Did this work? Are all the words in the list you print out lowercase?

No, they’re not. You’ll recall when we talked about mutable and immutable data types and compared the difference between how list methods and string methods behave. String methods do not change the string they’re acting on. Instead, they return a copy. You can override this by reassigning the string returned to the same variable:

with open("pride_and_prejudice.txt") as file:
    text = file.read()

text = text.lower()

words = text.split()
print(words)

The output from this code is the following—the list is truncated in the output shown here for display purposes:

['the', 'project', 'gutenberg', 'ebook', 'of', 'pride', 'and', 'prejudice,', 'by', 'jane', 'austen', 'pride', 'and', 'prejudice', 'by', 'jane', 'austen', 'chapter', '1', 'it', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in', 'want', 'of', 'a', 'wife.', 'however', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', ...

The first few words in the list highlight a few problematic entries. If you look at the 8th and 14th entries, you’ll see "prejudice," and "prejudice". They’re the same word, but not for Python:

>>> "prejudice," == "prejudice"
False

The extra comma in the first case makes the strings different. Since you’ll need to identify which words are repeated, you’ll need to remove this comma and all other punctuation marks. Earlier in this Chapter, you came across another string method that will come in useful here, replace(). You can start by removing all the commas first:

with open("pride_and_prejudice.txt") as file:
    text = file.read()

text = text.lower()
text = text.replace(",", " ")

words = text.split()
print(words)

You’re replacing all commas with a space in the whole text before splitting the string into a list of words:

['the', 'project', 'gutenberg', 'ebook', 'of', 'pride', 'and', 'prejudice', 'by', 'jane', 'austen', 'pride', 'and', 'prejudice', 'by', 'jane', 'austen', 'chapter', '1', 'it', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife.', 'however', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', ...

The commas have all gone. The commas you still see are the ones separating the words in the list—these are Python commas and not commas in the text. You’ll now need to repeat this for all the punctuation marks. You’ve seen that you have two options for repeating code, the for and while loops. Since you need to loop over all punctuation marks, the for loop is the best route to take. You can start by selecting a few punctuation marks:

with open("pride_and_prejudice.txt") as file:
    text = file.read()

text = text.lower()
for punctuation_mark in ".,?!;:-":
    text = text.replace(punctuation_mark, " ")

words = text.split()
print(words)

You can loop over any iterable data type, and a string is an iterable. Therefore, you can write a string with several punctuation marks in the for statement. The variable punctuation_mark will be equal to "." in the first iteration, and therefore the replace() method will replace all full stops with a space. In the second iteration of the for loop, punctuation_mark will be equal to ",", and so on.

You can scan through the output of your code to make sure that none of the punctuation marks you listed in the for loop statement are there.

Now you need to try to think of all possible punctuation marks. Or just rely on the ready-to-use string that Python has in one of its built-in modules called string. You can explore this in the Console first:

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

This module contains a string named punctuation that contains all the punctuation marks available. You can now remove all punctuation marks from Pride and Prejudice:

import string

with open("pride_and_prejudice.txt") as file:
    text = file.read()

text = text.lower()
for punctuation_mark in string.punctuation:
    text = text.replace(punctuation_mark, " ")

words = text.split()
print(words)

You’ve completed the first part of this project. You’ve read in the text from a text file and cleaned the data. You can now remove the final print() as you no longer need to see the list being displayed every time you run your script.

You learned about commenting as one way of making your code more readable. You can add a few concise and well-placed comments in your code:

import string

with open("pride_and_prejudice.txt") as file:
    text = file.read()

# Clean data by removing capitalisation and punctuation marks
text = text.lower()
for punctuation_mark in string.punctuation:
    text = text.replace(punctuation_mark, " ")

# Split string into a list of strings containing all the words
words = text.split()

There isn’t an ideal number of comments you’ll need to put in. This depends on your preference and style. I chose not to comment the first section where the file is being opened as the code seems clear enough. The comment in the final block is there to remind the reader that words is a list of strings. As programs grow, you may lose track of what data type different variables are storing, so a comment such as this can be useful.

Comments are not the only way to make the code more readable. Choosing descriptive variable names is just as important. Naming the variable defined in the for statement punctuation_mark makes the for loop more readable without the need for a comment.

The P&P Project: Analysing the Word Frequencies

You’ve converted the text into a list of words that have been cleaned. How do you proceed from this point? Let’s time travel back to the pre-computer age, and you can ask yourself: “How would I perform this task using pen and paper if I had lots of time to spare?”

Try writing down the steps you’ll need before reading on.

Here are the steps you’ll need to do to solve this problem:

Look at the first word and write it down on a sheet of paper. Add the number 1 next to it to show it’s the first time you found this word.
Move to the next word. Have you already encountered this word before?
- If this isn’t the first time you’ve found this word, find it on your sheet of paper and add 1 to the number next to it.
- If it’s the first time you came across this word, write it down at the bottom of the paper and add 1 next to it.
Repeat these steps until you’ve gone through the whole book.
Stop often to make yourself strong coffees!

You’ve now created the algorithm you’ll need to follow. Your next task is to translate these steps from English into Python, except for the last step that won’t translate well!

Before you can start writing the code for this next section, you have another important decision to make. What’s the best way to store your data as you go along? What data structure should you use? Although you can store the data in two lists, one containing the words and the other containing the number of times each word appear in the book, you’ve learned earlier in this Chapter that there’s a better way. You can use a dictionary.

It’s useful to create a dummy version of the dictionary you want in the Console to visualise it and so that you can experiment with it as needed:

>>> some_words = {"hello": 3, "python": 8, "bye": 1}

>>> some_words["computer"] = 2  # Add new word
>>> some_words
{'hello': 3, 'python': 8, 'bye': 1, 'computer': 2}

>>> some_words["python"] = some_words["python"] + 1  # Increment count for existing word
>>> some_words
{'hello': 3, 'python': 9, 'bye': 1, 'computer': 2}

Now you know what form your dictionary will take and how to add new words and increment the count for words already in the dictionary, you can go back to the P&P script. You’ll first need to create a variable with an empty dictionary stored in it:

# follows on from code you've already written above
# which reads from file and cleans the data

word_frequencies = {}

Next, you need to repeat the steps you listed earlier, going through each word in the text. You’ll need a for loop for this in which you iterate through the list of words you created earlier:

# follows on from code you've already written above
# which reads from file and cleans the data

word_frequencies = {}

for word in words:
    if word not in word_frequencies.keys():
        word_frequencies[word] = 1

It is good practice to use the singular version of the name you used for your list as the variable you define in the for loop statement, for word in words: This makes the code more readable.

For each word in the list of words, you’re checking if the word is already in the dictionary word_frequencies. You can deconstruct what’s happening in the if statement by going back to the dummy dictionary you created in the Console:

>>> # using the same variable some_words you created and modified earlier
>>> some_words
{'hello': 3, 'python': 9, 'bye': 1, 'computer': 2}

>>> some_words.keys()
dict_keys(['hello', 'python', 'bye', 'computer'])

>>> 'python' in some_words
True

>>> 'monday' in some_words
False

>>> 'monday' not in some_words
True

The dictionary method keys() returns a sequence containing all the keys in the dictionary. The keyword in can be used to return True or False based on whether the string you use is in the sequence containing all the keys. You haven’t come across the keyword not so far, but you can guess what it does from the example above!

There’s one last step left. You’ll need to add an else statement to follow the if to account for words that are already in the dictionary. This will happen when the for loop comes across a word for the second time, and third, and so on:

# follows on from code you've already written above
# which reads from file and cleans the data

word_frequencies = {}

# Loop through list of words to populate dictionary
for word in words:
    if word not in word_frequencies.keys():
        word_frequencies[word] = 1
    else:
        word_frequencies[word] = word_frequencies[word] + 1

Time to reveal what the variable word_frequencies contains by printing it:

print(word_frequencies)

The output is a very long dictionary—only a truncated version is displayed here, but you’ll be able to see the whole dictionary in your version:

{'the': 4333, 'project': 3, 'gutenberg': 2, 'ebook': 2, 'of': 3614, 'pride': 50, 'and': 3587, 'prejudice': 8, 'by': 638, 'jane': 294, 'austen': 3, 'chapter': 61, '1': 1, 'it': 1535, 'is': 860, 'a': 1954, 'truth': 27, 'universally': 3, 'acknowledged': 20, 'that': 1579, 'single': 12, 'man': 151, 'in': 1880, 'possession': 9, 'good': 201, 'fortune': 39, 'must': 308, 'be': 1241, 'want': 44, 'wife': 47, 'however': 134, 'little': 189, 'known': 58, 'feelings': 86, 'or': 299, ...

You have a dictionary with all the words in the book and the number of times each word is used. You can find out how many unique words there are in the book:

print(len(word_frequencies))

The length of the dictionary word_frequencies is the number of key-value pairs. This shows us that there are 6324 unique words in the book. I’m including the first and last lines of the text file which include the credit to the data source. These credits include a few words that are not in the actual book. If you’ve removed these lines, you’ll get a slightly lower number.

“Can I sort the dictionary based on word frequencies?” is a common question I’m asked. The answer is yes. Although it is possible to do this directly within Python, you’ll take a different approach in the next section, which will then allow you to sort using a tool you’re probably already familiar with.

Looping Through a Dictionary

Earlier in this Chapter, when you first learned about dictionaries, you saw that you could loop through a dictionary. However, you could only iterate through the keys of the dictionary. Let’s look at a better way of looping through a dictionary with a for loop.

A very brief detour first. Create a tuple containing two items:

>>> numbers = (5, 2)

You can unpack this tuple into two separate variables:

>>> first, second = numbers

>>> first
5
>>> second
2

Unpacking works with other sequences as well, not just tuples. Now, let’s look at another dictionary method called items(). You can use the same dummy dictionary you used earlier when experimenting in the Console:

>>> some_words = {'hello': 3, 'python': 9, 'bye': 1, 'computer': 2}
>>> some_words.items()
dict_items([('hello', 3), ('python', 9), ('bye', 1), ('computer', 2)])

This method returns an object of type dict_items. It doesn’t matter too much what this data type is. What matters is that it’s a sequence in which each key-value pair is grouped in a tuple. You can see that the first item in this sequence is ('hello', 3), and so on. You can therefore use the sequence returned from items() in a for loop:

>>> for something in some_words.items():
...     print(something)
...     
('hello', 3)
('python', 9)
('bye', 1)
('computer', 2)

You can now loop through a dictionary and get access to all the information in the dictionary, not just the keys. The variable something contains a tuple with two items. So you can unpack this tuple. Indeed, you can do the unpacking directly in the for statement:

>>> for word, frequency in some_words.items():
...     print(word)
...     print(frequency)
...     
hello
3
python
9
bye
1
computer
2

You’re now defining two variables in the for statement, word and frequency, and assigning values to them by unpacking the tuples in the sequence returned by some_words.items(). Therefore, you have access to both the key and the value in each iteration of the for loop.

Writing Data To a Spreadsheet

You almost have all the tools you need to go through the dictionary containing words and word frequencies and write its contents into a spreadsheet. There’s only one thing missing. You’ve seen how to open and read a file. You can also open and write to a file:

>>> file = open("test_file.txt", "w")
>>> file.write("Good Morning.\nI'm writing this to a file. Hurray!")
49
>>> file.close()

You’ve opened the file in a similar way as before, with one difference. There are now two arguments in open():

the name of the file as a string, including the file extension
a second string that shows the mode you want to open the file in

The string "w" stands for write. You’re opening the file in write-mode. The file test_file.txt did not exist in your folder, but opening a file in write-mode creates the file if it doesn’t exist.

A word of warning: if you open a file that already exists using the "w" argument, your file will be overwritten. Always make sure you have backups of any files you want to use to avoid accidentally deleting the contents of your file. Another mode available with open() is "a", which stands for append. A file opened in append-mode will allow you to write at the end of an existing file without deleting its existing contents.

Earlier in this Chapter when you opened pride_and_prejudice.txt, you only used the file name as an argument when calling open(). In this case, the mode defaults to "r", which stands for read—the file was opened in a read-only mode that keeps the original file safe from accidental modification in your Python code.

If you locate your project folder on your computer, you’ll now be able to find a new file called test_file.txt. Opening this file will show you a text file containing the following:

Good Morning.
I'm writing this to a file. Hurray!

If you look back at the string you used as an input argument in open(), you’ll notice the "\n" character after the first full stop. This is called an escape character. Escape characters in strings start with a backslash. This escape character represents the newline character. You’ll notice that the "\n" is not printed out in the text file but a new line starts at that point.

You may also have noticed that the write() method returned the integer 49. This is the number of characters written to file. You won’t need this value, so you can ignore it.

You can now return to the P&P project, and you’re ready to export the data to a spreadsheet. You’ll use the CSV file format, which is the most straightforward spreadsheet format. CSV stands for comma-separated values. A CSV file is a standard text file which has the .csv file extension in which each value is separated by a comma, as the name implies. These values are the contents of each cell in a spreadsheet. Your computer will open a CSV file with your default spreadsheet software.

The steps you’ll need to create a spreadsheet are:

open a write-enabled file
write the header line of the spreadsheet
loop through the dictionary word_frequencies and write each key-value pair to the file
close the file (unless you’re using a with statement)

You can now translate these steps into Python:

# follows on from existing code written earlier in P&P project

# Export words and frequencies to a CSV spreadsheet
file = open("words in Pride and Prejudice.csv", "w")
# Write header line
file.write("Word,Frequency\n")

# Loop through dictionary and write key-value pairs to csv
for word, frequency in word_frequencies.items():
    file.write(f"{word},{frequency}\n")
file.close()

Before the loop, you’re writing the top line of your spreadsheet which is the header row. You’ll need to include the newline character "\n" to show that this is the end of the line.

Within the for loop, you’re using an f-string to write the contents of the variables word and frequency which you’ve defined in the for statement. These represent the keys and values in the dictionary. There is also a comma separating the values and a newline character at the end.

You can now find a file called words in Pride and Prejudice.csv in your Project folder. If you double-click this CSV as you would any other file you want to open on your computer, it will open in your standard spreadsheet software. If you want to sort the words based on their frequencies, you can do so in your spreadsheet software.

Here’s the full code for the P&P project. In the version below, the with keyword is used to open the CSV file for writing, in line with what you’ve used earlier for reading the text file and modern best practices in Python. However, if you prefer to use the version with explicit open() and close() for now, you may do so:

import string

####
# PART 1: read and clean data
with open("pride_and_prejudice.txt") as file:
    text = file.read()

# Clean data by removing capitalisation and punctuation marks
text = text.lower()
for punctuation_mark in string.punctuation:
    text = text.replace(punctuation_mark, " ")

# Split string into a list of strings containing all the words
words = text.split()

####
# PART 2: find words and their frequencies
word_frequencies = {}

# Loop through list of words to populate dictionary
for word in words:
    if word not in word_frequencies.keys():
        word_frequencies[word] = 1
    else:
        word_frequencies[word] = word_frequencies[word] + 1

####
# PART 3: Export words and frequencies to a CSV spreadsheet

with open("words in Pride and Prejudice.csv", "w") as file:
    # Write header line
    file.write("Word,Frequency\n")

    # Loop through dictionary and write key-value pairs to csv
    for key, value in word_frequencies.items():
        file.write(f"{key},{value}\n")

In the Snippets section at the end of this Chapter, you’ll see a modified version of this code that packages the various parts into functions. You’ll also add an extra section that creates a simple quiz which will present you with a random word from the book and you’ll need to guess how often it appears in the book.

Conclusion

Data is a key part of every computer program. Programming languages like Python have a large range of data types and data structures to deal with all requirements. Learning the differences and similarities between different data types, how to convert between data types and how to manipulate data stored in variables is a key part of learning to code.

In this Chapter, you’ve covered:

What’s the difference between iterable and non-iterable data types
What’s the difference between mutable and immutable data types
How to use methods associated with data types
How to use tuples and dictionaries
How to read data from a file
How to write data to a file

Your next stop on this journey will be full of errors and bugs! The next Chapter focuses on the differences between errors and bugs and learning how to deal with errors and how to find and fix bugs.

Next Chapter

Browse Zeroth Edition

Become a Member of

The Python Coding Place

Video courses, live cohort-based courses, workshops, weekly videos, members’ forum, and more…

Become a Member

Subscribe to

The Python Coding Stack

Regular articles for the intermediate Python programmer or a beginner who wants to “read ahead”

Snippets

List Comprehensions and Other Comprehensions

In this Chapter and in the previous one, you have learned about a common and useful algorithm in programming. You need to populate a list or another data structure. You first create an empty list and then you use a for loop to write the code that will add items to that list. You can also perform any operations that may be needed within the for loop.

For example, let’s assume you have a list of names, and you’d like to create a new list containing the same names in uppercase letters:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = []
for name in names:
    new_names.append(name.upper())

print(new_names)

The list names has five items in it. The output from this code shows a list containing the uppercase versions of these names:

['JOHN', 'MARY', 'ISHAN', 'SUE', 'GABY']

The steps you use in this algorithm are:

Create a new empty list
Iterate through the original list using a for loop
Perform the required operation on each item of the original list and append the result to the new list

These three steps are so common in programming that Python has a shorter way of achieving the same result. You can use list comprehensions:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = [name.upper() for name in names]

print(new_names)

You’ve now replaced the three lines in the original code which create the empty list and then populate it using a for loop with a single line of code. This line is a list comprehension. Let’s look at the various elements on this line:

You’re creating a new variable called new_names in the usual manner with the assignment operator.
On the right-hand side of the = sign, you have the square brackets which indicate a list.
The first term inside the square brackets references a variable name that doesn’t exist yet. You’ll get to this in the next bullet point. The upper() string method is applied to whatever is stored in the variable name and placed in the list.
Following the first term inside the list comprehension, you have a for statement. This for statement is identical to the for statement you used in the original for loop.

Another way to look at what’s happening in this list comprehension is to translate this line into plain English. The translation would read like this: Create a list called new_names and fill it with the output from name.upper() for every name in the list of names.

Efficiency of list comprehensions

List comprehensions make code more compact and quicker to write. However, there are also some efficiency advantages when using list comprehensions compared to the longer version.

When you create an empty list, the computer program doesn’t know how large the list should be and allocates a certain amount of memory for this list. Remember that code is executed one line at a time, and the for loop is only executed when the program runs. Therefore, the program would need to be able to see into the future to know what size this list should be, which is not possible. As you append more items to the list in the for loop, the program may need to make more space available in memory to add the new item. This takes up time.

With a list comprehension, as the whole process happens on a single line, the program can be more efficient and creates a list of the correct size straight away.

You’ll read more about the difference in efficiency between these two versions of code in a later Chapter when you’ll also compare this to other ways of solving the same problem.

List comprehensions with conditional statements

Let’s assume that in your new list, you’d only like to have the uppercase versions of the names that are four letters long. In the classic version of the algorithm, you can add an if statement to your for loop:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = []
for name in names:
    if len(name) == 4:
        new_names.append(name.upper())

print(new_names)

Only when the length of name is equal to 4 will you add the uppercase version of the name to the new list. This gives the following output:

['JOHN', 'MARY', 'GABY']

You can achieve the same result with a list comprehension:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = [name.upper() for name in names if len(name) == 4]

print(new_names)

This code gives the same output as the longer version before it. You’ve managed to squeeze four lines of code into one.

The translation now reads: Create a list called new_names and fill it with the output from name.upper() for every name in the list of names if the length of the name is 4. The additional part is shown in bold. You can see that the Python code and its English translation are not that different!

Let’s make one final addition. If the name is not four letters long, then you’d like to place the string "xxxx" in the list instead of the name. In the original version of the code, you can add an else to the if statement:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = []
for name in names:
    if len(name) == 4:
        new_names.append(name.upper())
    else:
        new_names.append("xxxx")

print(new_names)

This code give the following result:

['JOHN', 'MARY', 'xxxx', 'xxxx', 'GABY']

List comprehension can also be used to achieve this:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = [name.upper() if len(name) == 4 else "xxxx" for name in names]

print(new_names)

The if is now followed by an else in the list comprehension. You may have noticed that the if/else statement precedes the for statement in this list comprehension. However, when there was no else, as in the previous example, the if statement came after the for statement.

If this were the main text, I’d refer you to a snippet to explain why this is the case, as it’s not key to understanding list comprehensions. However, you’re already reading a snippet! The short answer is that the if/else combination is itself a Python operator called a ternary operator. Therefore, the term preceding the for statement is name.upper() if len(name) == 4 else "xxxx" which is a valid Python statement on its own. However, I won’t discuss the ternary operator further.

Other comprehensions

List comprehensions are the most commonly used type of comprehension. However, you can use comprehensions with other data structures, too.

If you want to create a dictionary where the keys consist of each name in the list of names, and the value of each key is the length of the name, you could write the following code:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

name_lengths = {}
for name in names:
    name_lengths[name] = len(name)

print(name_lengths)

The algorithm is similar to the earlier version with lists. You create an empty dictionary and then iterate through the original list using a for loop to add items to the dictionary. The output from this code is the following:

{'John': 4, 'Mary': 4, 'Ishan': 5, 'Sue': 3, 'Gaby': 4}

You can achieve the same output using a dictionary comprehension:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

name_lengths = {name: len(name) for name in names}

print(name_lengths)

The statement before the for keyword now includes the key and the value separated by the colon, as is the case in all dictionaries.

Dictionary comprehensions can get a bit more complex if the keys and values come from different data structures.

You may be tempted to try the same thing with a tuple:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = (name.upper() for name in names)

print(new_names)

The output from this code gives the following result:

<generator object <genexpr> at 0x7fe5ee5824a0>

You’ll read about generators later on in this book. Therefore, I won’t dwell on them at this stage. If you’d like to create a tuple using a comprehension, you can do so as follows:

names = ["John", "Mary", "Ishan", "Sue", "Gaby"]

new_names = tuple(name.upper() for name in names)

print(new_names)

And this does indeed give a tuple:

('JOHN', 'MARY', 'ISHAN', 'SUE', 'GABY')

Comprehensions can look a bit weird initially. However, once you get used to them, you’ll see them as a convenient and useful way of writing neater and more efficient code. And they’ll save you a few lines of code, too!

Next Chapter

Browse Zeroth Edition