1

NOTE: this question covers why the script is so slow. However, if you are more the kind of person who wants to improve something you can take a look atmy post on CodeReview which aims to improve the performance.

I am working on a project which crunches plain text files (.lst).

The name of the file names (fileName) are important because I'll extract node (e.g. abessijn) and component (e.g. WR-P-E-A) from them into a dataframe. Examples:

abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst

Each file consists of one or more line. Each line consists of a sentence (inside <sentence> tags). Example (abessijn.WR-P-E-A.lst)

/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml:  <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml:  <sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>

From each line I extract the sentence, do some small modifications to it, and call it sentence. Up next is an element called leftContext, which takes the first part of the split between node (e.g. abessijn) and the sentence it came from. Finally, from leftContext I get precedingWord, which is the word preceding node in sentence, or the right most word in leftContext (with some limitations such as the option of a compound formed with a hyphen). Example:

ID | filename             | node | component | precedingWord      | leftContext                               |  sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1    adapter.WR-P-P-F.lst  adapter  WR-P-P-F   aanpassingseenheid  Een aanpassingseenheid (                      Een aanpassingseenheid ( adapter ) , 
2    adapter.WR-P-P-F.lst  adapter  WR-P-P-F   toestel             Het toestel (                                 Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3    adapter.WR-P-P-F.lst  adapter  WR-P-P-F   de                  de aansluiting tussen de sensor en de         de aansluiting tussen de sensor en de adapter , 
4    airbag.WS-U-E-A.lst   airbag   WS-U-E-A   den                 ja voor den                                   ja voor den airbag op te pompen eh :p
5    airbag.WS-U-E-A.lst   airbag   WS-U-E-A   ne                  Dobby , als ze valt heeft ze dan wel al ne    Dobby , als ze valt heeft ze dan wel al ne airbag hee

That dataframe is exported as dataset.csv.

After that, the intention of my project comes at hand: I create a frequency table that takes node and precedingWord into account. From a variable I define neuter and non_neuter, e.g (in Python)

neuter = ["het", "Het"]
non_neuter = ["de","De"]

and a rest category unspecified. When precedingWord is an item from the list, assign it to the variable. Example of a frequency table output:

node    |   neuter   | nonNeuter   | unspecified
-------------------------------------------------
A-bom       0          4             2
acroniem    3          0             2
act         3          2             1

The frequency list is exported as frequencies.csv.


I started out with R, considering that later on I'd do some statistical analyses on the frequencies. My current R script (also available as paste):

# ---
# STEP 0: Preparations
  start_time <- Sys.time()
  ## 1. Set working directory in R
    setwd("")

  ## 2. Load required library/libraries
    library(dplyr)
    library(mclm)
    library(stringi)

  ## 3. Create directory where we'll save our dataset(s)
    dir.create("../R/dataset", showWarnings = FALSE)


# ---
# STEP 1: Loop through files, get data from the filename

    ## 1. Create first dataframe, based on filename of all files
    files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
    d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE)

    ## 2. Create additional columns (word & component) based on filename
    d$node <- sub("\\..+", "", d$fileName, perl=TRUE)
    d$node <- tolower(d$node)
    d$component <- gsub("^[^\\.]+\\.|\\.lst$", "", d$fileName, perl=TRUE)


# ---
# STEP 2: Loop through files again, but now also through its contents
# In other words: get the sentences

    ## 1. Create second set which is an rbind of multiple frames
    ## One two-column data.frame per file
    ## First column is fileName, second column is data from each file
    e <- do.call(rbind, lapply(files, function(x) {
        data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE)
    }))

    ## 2. Clean fileName
     e$fileName <- sub("^\\.\\/", "", e$fileName, perl=TRUE)

    ## 3. Get the sentence and clean
    e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\\1", e$sentence, perl=TRUE)
    e$sentence <- tolower(e$sentence)
        # Remove floating space before/after punctuation
        e$sentence <- gsub("\\s(?:(?=[.,:;?!) ])|(?<=\\( ))", "\\1", e$sentence, perl=TRUE)
    # Add space after triple dots ...
      e$sentence <- gsub("\\.{3}(?=[^\\s])", "... ", e$sentence, perl=TRUE)

    # Transform HTML entities into characters
    # It is unfortunate that there's no easier way to do this
    # E.g. Python provides the HTML package which can unescape (decode) HTML
    # characters
        e$sentence <- gsub("&apos;", "'", e$sentence, perl=TRUE)
        e$sentence <- gsub("&amp;", "&", e$sentence, perl=TRUE)
      # Avoid R from wrongly interpreting ", so replace by single quotes
        e$sentence <- gsub("&quot;|\"", "'", e$sentence, perl=TRUE)

      # Get rid of some characters we can't use such as ³ and ¾
      e$sentence <- gsub("[^[:graph:]\\s]", "", e$sentence, perl=TRUE)


# ---
# STEP 3:
# Create final dataframe

  ## 1. Merge d and e by common column name fileName
    df <- merge(d, e, by="fileName", all=TRUE)

  ## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account
    matchFunction <- function(x, y) any(x == y)
    matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]")))
    df <- df[matchedFrame, ]

  ## 3. Create leftContext based on the split of the word and the sentence
    # Use paste0 to make sure we are looking for the node, not a compound
    # node can only be preceded by a space, but can be followed by punctuation as well
    contexts <- strsplit(df$sentence, paste0("(^| )", df$node, "( |[!\",.:;?})\\]])"), perl=TRUE)
    df$leftContext <- sapply(contexts, `[`, 1)

  ## 4. Get the word preceding the node
    df$precedingWord <- gsub("^.*\\b(?<!-)(\\w+(?:-\\w+)*)[^\\w]*$","\\1", df$leftContext, perl=TRUE)

  ## 5. Improve readability by sorting columns
    df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")]

  ## 6. Write dataset to dataset dir
    write.dataset(df,"../R/dataset/r-dataset.csv")


# ---
# STEP 4:
# Create dataset with frequencies

  ## 1. Define neuter and nonNeuter classes
    neuter <- c("het")
    non.neuter<- c("de")

  ## 2. Mutate df to fit into usable frame
    freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified",
      ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter")))

  ## 3. Transform into table, but still usable as data frame (i.e. matrix)
  ## Also add column name "node"
    freqTable <- table(freq$node, freq$gender) %>%
      as.data.frame.matrix %>%
      mutate(node = row.names(.))

  ## 4. Small adjustements
    freqTable <- freqTable[,c(4,1:3)]

  ## 5. Write dataset to dataset dir
    write.dataset(freqTable,"../R/dataset/r-frequencies.csv")


    diff <- Sys.time() - start_time # calculate difference
    print(diff) # print in nice format

However, since I'm using a big dataset (16,500 files, all with multiple lines) it seemed to take quite long. On my system the whole process took about an hour and a quarter. I thought to myself that there ought to be a language out there that could do this more quickly, so I went and taught myself some Python and asked a lot of question here on SO.

Finally I came up with the following script (paste).

import os, pandas as pd, numpy as np, regex as re

from glob import glob
from datetime import datetime
from html import unescape

start_time = datetime.now()

# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)

# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))

# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")

p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")

p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)

# Loop files in folder
for file in glob(path+"\\*.lst"):
    with open(file, encoding="utf-8") as f:
        [n, c] = p_filename.split(file.lower())[-3:-1]
        fn = ".".join([n, c])
        for line in f:
            s = p_sentence.search(unescape(line)).group(1)
            s = s.lower()
            s = p_typography.sub("", s)
            s = p_non_graph.sub("", s)
            s = p_quote.sub("'", s)
            s = p_ellipsis.sub("... ", s)

            if n in re.split(r"[ :?.,]", s):
                lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]

                pw = p_last_word.sub("\\1", lc)

                df = df.append([dict(fileName=fn, component=c, 
                                   precedingWord=pw, node=n, 
                                   leftContext=lc, sentence=s)])
            continue

# Reset indices
df.reset_index(drop=True, inplace=True)

# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")

# Let's make a frequency list
# Create new dataframe

# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]

# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"

freqDf = pd.crosstab(df.node, df.gender)

freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")

# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)

After making sure that the output of both scripts is identical, I thought I'd put them to the test.

I am running on Windows 10 64 bit with a quad-core processor and 8 GB Ram. For R I'm using RGui 64 bit 3.2.2 and Python runs on version 3.4.3 (Anaconda) and is executed in Spyder. Note that I'm running Python in 32 bit because I'd like to use the nltk module in the future and they discourage users to use 64 bit.

What I found was that R finished in approximately 55 minutes. But Python has been running for two hours straight already and I can see in the variable explorer that it's only at business.wr-p-p-g.lst (files are sorted alphabetically). It is waaaaayyyy slower!

So what I did was make a test case and see how both scripts perform with a much smaller dataset. I took around 100 files (instead of 16,500) and ran the script. Again, R was much faster. R finished in around 2 seconds, Python in 17!

Seeing that the goal of Python was to make everything go more smoothly, I was confused. I read Python was fast (and R rather slow), so where did I go wrong? What is the problem? Is Python slower in reading files and lines, or in doing regexes? Or is R simply better equipped to dealing with dataframes and can't it be beaten by pandas? Or is my code simply badly optimised and should Python indeed be the victor?

My question is thus: why is Python slower than R in this case, and - if possible - how can we improve Python to shine?

Everyone who is willing to give either script a try can download the test data I used here. Please give me a heads-up when you downloaded the files.

Community
  • 1
  • 1
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • 1
    A quick scan suggests the fact you are opening each file in a loop in python: `with open(file, encoding="utf-8") as f` will not be as nice as the r equivalent `e <- do.call(rbind, lapply(files, function(x) {....` – jeremycg Aug 20 '15 at 14:52
  • 2
    Your `R` code is simply more optimised for the language. There are no `for` loops and you make heavy use of vectorised operations and built-in functions that are actually written in C/Fortran. Your Python code is simply highly inefficient. That is it. – Eli Korvigo Aug 20 '15 at 14:53
  • @jeremycg And is there any way to do something similar in Python? For instance, stitch all text files together somehow? – Bram Vanroy Aug 20 '15 at 15:05
  • @EliKorvigo can you suggest ways to optimise my Python? As I said I'm completely new to the language so any direction is appreciated. – Bram Vanroy Aug 20 '15 at 15:06
  • 1
    I'm not a code review expert but... this seems more fit for the code review site. – Dason Aug 20 '15 at 16:00
  • 2
    @Dason The question is asking why a code is slower than the other. This type of question is off-topic on Code Review since the intention is to obtain an explanation about the code and not a review. – Ismael Miguel Aug 20 '15 at 16:03
  • @jeremycg In case you're interested, I added a post on [CR to improve performance](http://codereview.stackexchange.com/questions/101648/speed-up-python-execution-time). – Bram Vanroy Aug 22 '15 at 12:54
  • @Dason As you proposed I added [a post on CR](http://codereview.stackexchange.com/questions/101648/speed-up-python-execution-time). – Bram Vanroy Aug 22 '15 at 12:55

1 Answers1

5

The most horribly inefficient thing you do is calling the DataFrame.append method in a loop, i.e.

df = pandas.DataFrame(...)
for file in files:
    ...
    for line in file:
        ...
        df = df.append(...)

NumPy data structures are designed with functional programming in mind, hence this operation is not meant to be used in an iterative imperative fashion, because the call doesn't change your data frame in-place, but it creates a new one, resulting in an enormous increase in time and memory complexity. If you really want to use data frames, accumulate your rows in a list and pass it to the DataFrame constructor, e.g.

pre_df = []
for file in files:
    ...
    for line in file:
        ...
        pre_df.append(processed_line)

df = pandas.DataFrame(pre_df, ...)

This is the easiest way since it will introduce minimal changes to the code you have. But the better (and computationally beautiful) way is to figure out how to generate your dataset lazily. This can be easily achieved by splitting your workflow into discrete functions (in the sense of functional programming style) and compose them using lazy generator expressions and/or imap, ifilter higher-order functions. Then you can use the resulting generator to build your dataframe, e.g.

df = pandas.DataFrame.from_records(processed_lines_generator, columns=column_names, ...)

As for reading multiple files in one run you might want to read this.

P.S.

If you've got performance issues you should profile your code before trying to optimise it.

Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
  • I will answer in detail once I've taken a look at laziness, and imap and ifilter (Apple doesn't patent those, does it? `;-)`) But what do you mean by "profile" in your last sentence? – Bram Vanroy Aug 20 '15 at 18:43
  • 1
    @BramVanroy I mean [code prodiling](https://en.wikipedia.org/wiki/Profiling_(computer_programming)). Get a copy of PyCharm (there is a free edition), it has profiling tools built-in among many other goodies. You would have found the `DataFrame.append` bottleneck yourself had you profiled your code. – Eli Korvigo Aug 20 '15 at 19:00
  • I have been trying my luck with some of the expressions that you mentioned but no luck. I'll probably be adding some new questions to SO soon... – Bram Vanroy Aug 22 '15 at 12:42
  • When I tried using your second example, Spyder threw the error *AttributeError: 'DataFrame' object has no attribute 'precedingWord'*. I have decided to post on CodeReview. If you are still interested in helping me with this, [you can take a look here](http://codereview.stackexchange.com/questions/101648/speed-up-python-execution-time). In the meanwhile I'll accept your answer because you explained *why* my script is slow. – Bram Vanroy Aug 22 '15 at 12:52
  • 1
    @BramVanroy this still feels like a question for Stack Overflow, because your code is broken. CodeReview is for reviewing working code. Post your newer version in Python on SO after you're sure that available SO questions and answers don't address the problem. It's hard to tell why you're getting the error without seeing the code. – Eli Korvigo Aug 22 '15 at 13:12
  • I posted the original code on CR and that code works. So I suppose it is well suited for CR. – Bram Vanroy Aug 22 '15 at 13:17