NOTE: this question covers why the script is so slow. However, if you are more the kind of person who wants to improve something you can take a look atmy post on CodeReview which aims to improve the performance.
I am working on a project which crunches plain text files (.lst).
The name of the file names (fileName
) are important because I'll extract node
(e.g. abessijn) and component
(e.g. WR-P-E-A) from them into a dataframe. Examples:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
Each file consists of one or more line. Each line consists of a sentence (inside <sentence>
tags). Example (abessijn.WR-P-E-A.lst)
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>
From each line I extract the sentence, do some small modifications to it, and call it sentence
. Up next is an element called leftContext
, which takes the first part of the split between node
(e.g. abessijn) and the sentence it came from. Finally, from leftContext
I get precedingWord, which is the word preceding node
in sentence
, or the right most word in leftContext
(with some limitations such as the option of a compound formed with a hyphen). Example:
ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee
That dataframe is exported as dataset.csv.
After that, the intention of my project comes at hand: I create a frequency table that takes node
and precedingWord
into account. From a variable I define neuter
and non_neuter
, e.g (in Python)
neuter = ["het", "Het"]
non_neuter = ["de","De"]
and a rest category unspecified
. When precedingWord
is an item from the list, assign it to the variable. Example of a frequency table output:
node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
The frequency list is exported as frequencies.csv.
I started out with R, considering that later on I'd do some statistical analyses on the frequencies. My current R script (also available as paste):
# ---
# STEP 0: Preparations
start_time <- Sys.time()
## 1. Set working directory in R
setwd("")
## 2. Load required library/libraries
library(dplyr)
library(mclm)
library(stringi)
## 3. Create directory where we'll save our dataset(s)
dir.create("../R/dataset", showWarnings = FALSE)
# ---
# STEP 1: Loop through files, get data from the filename
## 1. Create first dataframe, based on filename of all files
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE)
## 2. Create additional columns (word & component) based on filename
d$node <- sub("\\..+", "", d$fileName, perl=TRUE)
d$node <- tolower(d$node)
d$component <- gsub("^[^\\.]+\\.|\\.lst$", "", d$fileName, perl=TRUE)
# ---
# STEP 2: Loop through files again, but now also through its contents
# In other words: get the sentences
## 1. Create second set which is an rbind of multiple frames
## One two-column data.frame per file
## First column is fileName, second column is data from each file
e <- do.call(rbind, lapply(files, function(x) {
data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE)
}))
## 2. Clean fileName
e$fileName <- sub("^\\.\\/", "", e$fileName, perl=TRUE)
## 3. Get the sentence and clean
e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\\1", e$sentence, perl=TRUE)
e$sentence <- tolower(e$sentence)
# Remove floating space before/after punctuation
e$sentence <- gsub("\\s(?:(?=[.,:;?!) ])|(?<=\\( ))", "\\1", e$sentence, perl=TRUE)
# Add space after triple dots ...
e$sentence <- gsub("\\.{3}(?=[^\\s])", "... ", e$sentence, perl=TRUE)
# Transform HTML entities into characters
# It is unfortunate that there's no easier way to do this
# E.g. Python provides the HTML package which can unescape (decode) HTML
# characters
e$sentence <- gsub("'", "'", e$sentence, perl=TRUE)
e$sentence <- gsub("&", "&", e$sentence, perl=TRUE)
# Avoid R from wrongly interpreting ", so replace by single quotes
e$sentence <- gsub(""|\"", "'", e$sentence, perl=TRUE)
# Get rid of some characters we can't use such as ³ and ¾
e$sentence <- gsub("[^[:graph:]\\s]", "", e$sentence, perl=TRUE)
# ---
# STEP 3:
# Create final dataframe
## 1. Merge d and e by common column name fileName
df <- merge(d, e, by="fileName", all=TRUE)
## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account
matchFunction <- function(x, y) any(x == y)
matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]")))
df <- df[matchedFrame, ]
## 3. Create leftContext based on the split of the word and the sentence
# Use paste0 to make sure we are looking for the node, not a compound
# node can only be preceded by a space, but can be followed by punctuation as well
contexts <- strsplit(df$sentence, paste0("(^| )", df$node, "( |[!\",.:;?})\\]])"), perl=TRUE)
df$leftContext <- sapply(contexts, `[`, 1)
## 4. Get the word preceding the node
df$precedingWord <- gsub("^.*\\b(?<!-)(\\w+(?:-\\w+)*)[^\\w]*$","\\1", df$leftContext, perl=TRUE)
## 5. Improve readability by sorting columns
df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")]
## 6. Write dataset to dataset dir
write.dataset(df,"../R/dataset/r-dataset.csv")
# ---
# STEP 4:
# Create dataset with frequencies
## 1. Define neuter and nonNeuter classes
neuter <- c("het")
non.neuter<- c("de")
## 2. Mutate df to fit into usable frame
freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified",
ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter")))
## 3. Transform into table, but still usable as data frame (i.e. matrix)
## Also add column name "node"
freqTable <- table(freq$node, freq$gender) %>%
as.data.frame.matrix %>%
mutate(node = row.names(.))
## 4. Small adjustements
freqTable <- freqTable[,c(4,1:3)]
## 5. Write dataset to dataset dir
write.dataset(freqTable,"../R/dataset/r-frequencies.csv")
diff <- Sys.time() - start_time # calculate difference
print(diff) # print in nice format
However, since I'm using a big dataset (16,500 files, all with multiple lines) it seemed to take quite long. On my system the whole process took about an hour and a quarter. I thought to myself that there ought to be a language out there that could do this more quickly, so I went and taught myself some Python and asked a lot of question here on SO.
Finally I came up with the following script (paste).
import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\\*.lst"):
with open(file, encoding="utf-8") as f:
[n, c] = p_filename.split(file.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\\1", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
After making sure that the output of both scripts is identical, I thought I'd put them to the test.
I am running on Windows 10 64 bit with a quad-core processor and 8 GB Ram. For R I'm using RGui 64 bit 3.2.2 and Python runs on version 3.4.3 (Anaconda) and is executed in Spyder. Note that I'm running Python in 32 bit because I'd like to use the nltk module in the future and they discourage users to use 64 bit.
What I found was that R finished in approximately 55 minutes. But Python has been running for two hours straight already and I can see in the variable explorer that it's only at business.wr-p-p-g.lst
(files are sorted alphabetically). It is waaaaayyyy slower!
So what I did was make a test case and see how both scripts perform with a much smaller dataset. I took around 100 files (instead of 16,500) and ran the script. Again, R was much faster. R finished in around 2 seconds, Python in 17!
Seeing that the goal of Python was to make everything go more smoothly, I was confused. I read Python was fast (and R rather slow), so where did I go wrong? What is the problem? Is Python slower in reading files and lines, or in doing regexes? Or is R simply better equipped to dealing with dataframes and can't it be beaten by pandas? Or is my code simply badly optimised and should Python indeed be the victor?
My question is thus: why is Python slower than R in this case, and - if possible - how can we improve Python to shine?
Everyone who is willing to give either script a try can download the test data I used here. Please give me a heads-up when you downloaded the files.