How to read CSV file in Python

How to read CSV file in Python

December 24, 2017
Python, Tools

TLDR;

Using pandas (pip install pandas)

import pandas as pd
data = pd.read_csv("actors.csv").to_dict(orient="row")

Using native csv.DictReader

import csv
with open("actors.csv") as f:
    data = [r for r in csv.DictReader(f)]

Reading as a List

Let’s say we have the following CSV file, named actors.csv. You can download this file here.

The first method demonstrated in Python docs would read the file as follows:

  1. Open the file
  2. Create a CSV reader
  3. Skip first line (header)
  4. For every line (row) in the file, do something
import csv
with open("actors.csv") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    data = []
    for row in reader:
        data.append(row)

This can be simplified a bit with list comprehension, replacing the for loop with [r for r in reader].

import csv
with open("actors.csv") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    data = [r for r in reader]

I am calling next(reader) to skip the header row (First Name, Last Name etc). Header MUST be removed, otherwise it will show up as one of the lists in your data. Header can also be removed via pop command, as follows:

with open("actors.csv") as f:
    reader = csv.reader(f)
    data = [r for r in reader]
    data.pop(0) # remove header

Either of those will load the following data:

[
    [
        "Tom",
        "Cruise",
        "July 3, 1962"
    ],
    [
        "Bruce",
        "Willis",
        "March 19, 1955"
    ],
    [
        "Morgan",
        "Freeman",
        "June 1, 1937"
    ],
    [
        "John",
        "Wayne",
        "May 26, 1907"
    ]
]

The result will be a list of lists. We could get Tom’s name, for example, via data[0][0] command.

In [5]: data[0][0]
Out[5]: 'Tom'

Reading as a Dictionary

Python has another method for reading csv files – DictReader. As the name suggest, the result will be read as a dictionary, using the header row as keys and other rows as a values.

For example this:

import csv
with open("actors.csv") as f:
    reader = csv.DictReader(f)
    data = [r for r in reader]

Will result in a data dict looking as follows:

[
    {
    "First Name": "Tom",
    "Last Name": "Cruise",
    "Date of Birth": "July 3, 1962"
    },
    {
    "First Name": "Bruce",
    "Last Name": "Willis",
    "Date of Birth": "March 19, 1955"
    },
    {
    "First Name": "Morgan",
    "Last Name": "Freeman",
    "Date of Birth": "June 1, 1937"
    },
    {
    "First Name": "John",
    "Last Name": "Wayne",
    "Date of Birth": "May 26, 1907"
    }
]

With this approach, there is no need to worry about the header row. Most importantly now data can be accessed as follows:

In: data[0]['First Name']
Out: 'Tom'

Which is much more descriptive then just data[0][0].

Reading as a Named Tuple

The problem with dictionaries in Python, is that they cannot be access (unlike JavaScript) via the dot notation. For example in our earlier example we could not do data[0].first even if the row name was first. We have to do the full key in square brackets.

Often I prefer to load my CSV file as a list of named tuples.

Named tuple is a special construct that allows accessing values via the dot notation. For example:

from collections import namedtuple

# First param  - name of "Class"
# Second param - all fields (keys) that will be supported, space separated
Person = namedtuple("Person", "name age")

person = Person("Tom", 55) # New Person

print(person.name) # Will print Tom

I can read most CSV files that I work with, as such:

with open("actors.csv") as f:
    reader = csv.reader(f)
    Data = namedtuple("Data", next(reader))
    data = [Data(*r) for r in reader]

The one bit of magic here is Data(*r). The *r simply tells Python interpreter to pass every value in the list i.e. ["Tom", "Cruise", "July 3, 1962"] as a parameter. Equivalent to calling Data with Data("Tom", "Cruise", "July 3, 1962").

Unfortunately, in our example the code will fail, since our header labels cannot serve as valid keys for a named tuple.

Rules for valid keys can be found, here. Most notably every identifier (i.e. named tuple key or a variable) must follow this rule.

Identifiers can be a combination of letters in lowercase (a to z) or uppercase (A to Z) or digits (0 to 9) or an underscore (_)

I added a bit of logic to make keys in this example valid by replacing spaces with underscores. I also converted everything to lower case to make it a bit easier to type.

with open("actors.csv") as f:
    reader = csv.reader(f)
    top_row = next(reader)
    top_row = [t.lower().split()) for t in top_row]
    top_row = ["_".join(top_row)]
    Data = namedtuple("Data", top_row)
    data = [Data(*r) for r in reader]

This may seem like a lot of extra work, but for me it’s worth it. I use iPython as my REPL (available via pip install ipython). Dot notation saves me a lot of time by removing the need to type [" "] for every key. In addition, iPython provides a helpful sugestion list after typing . and hitting the tab key.

Reading Using Pandas

Pandas is a very popular Data Analysis library for Python. It can be installed via pip install pandas. For a brief introduction to Pandas check out Crunching Honeypot IP Data with Pandas and Python.

Pandas makes it really easy to open CSV file and convert it to Dictionary, via:

import pandas as pd
data = pd.read_csv("actors.csv").to_dict(orient="row")

Results in:

[
  {
    "First Name": "Tom",
    "Last Name": "Cruise",
    "Date of Birth": "July 3, 1962"
  },
  {
    "First Name": "Bruce",
    "Last Name": "Willis",
    "Date of Birth": "March 19, 1955"
  },
  {
    "First Name": "Morgan",
    "Last Name": "Freeman",
    "Date of Birth": "June 1, 1937"
  },
  {
    "First Name": "John",
    "Last Name": "Wayne",
    "Date of Birth": "May 26, 1907"
  }
]