Johan Nilssons Lifestream

Are columns in pandas indexed for searching? [on hold]

Update: This is a re-asked quesiton with a Minimal, Complete, and Verifiable example.

Querying pandas dataframe (via masking) is very slow compared to explicitly building a dictionary and searching in a for loop, which is counter-intuitive.


I observed this on my research data. I want to update values for dataframe y (˜100k rows) from dataframe c (˜1m rows).

Basically c contains the values of y, but in different format.

Column 1 of y is the ID. Columns 2-6 are values.

Column 1 of c is also the ID in y. Column 2 of c is the column position in y, column 3 is the values to update to y.

e.g., c.iloc[0, :] = [100, 2, 50] and if id 100 is at row 20 in y, then we should update y.iloc[20, 2] = 50

Minimal, Complete, and Verifiable example:

import pandas as pd
import numpy as np
from copy import deepcopy
import time

ylength = 20000
clength = 25000
y = pd.DataFrame(np.zeros((ylength, 6)))
y.iloc[:, 0] = np.arange(ylength)
c = pd.DataFrame(np.zeros((clength, 3)))
c.iloc[:, 0] = np.random.randint(0, ylength, clength)
c.iloc[:, 1] = np.random.randint(1, 6, clength)
c.iloc[:, 2] = np.random.randint(1, 100, clength)

def using_pandas(y,c):
    y = deepcopy(y)
    c = deepcopy(c)
    c = c.values
    for row in c:
        y.iloc[[y.iloc[:, 0] == row[0]], row[1]] = row[2]
    return y

def manually(y,c):
    y = deepcopy(y)
    c = deepcopy(c)
    y = y.values
    c = c.values
    # Build dict.
    data2rownum = {}
    for i, rowi in enumerate(y):
        data2rownum[rowi[0]] = i
    # Then, update.
    for row in c:
        if row[0] in data2rownum:
            y[data2rownum[row[0]], row[1]] = row[2]
    y = pd.DataFrame(y)
    return y

startp = time.time()
p = using_pandas(y,c)
stopp = time.time()

startm = time.time()
m = manually(y,c)
stopm = time.time()

print('Check consistency: ' + str((p==m).all().all()))
print('Using pandas time: ' + str(stopp-startp))
print('Manually indexing time: ' + str(stopm-startm))


Check consistency: True

Using pandas time: 69.5297718048

Manually indexing time: 0.0667359828949

Environment: Python 2.7.13, Pandas 0.19.2

As you can see, pandas is much slower than manually indexing using a dictionary.

The main question is: Are columns in pandas indexed for searching by default? If not, how to explicitly create index?

Thank you.

Note: This example is not working on Python 3.5, as reported in @Sebastian Wozny's answer.

via Stack Overflow

blog comments powered by Disqus
Get the source for phplifestream at Github