I need to enrich my dataframe in PySpark-Sql with a language attribute, that basically tells the language of a paper title for each row. I need to filter out English papers only. I've tens of millions of papers, so I need to do it in parallel.
I have registered an UDF using a Python library called langdetect
(https://pypi.org/project/langdetect/), after having installed the library on the cluster. I'm using the following code:
from langdetect import detect
def lang_detector(_s):
try:
lan = detect(_s)
except:
lan = 'null'
return lan
detect2 = udf(lang_detector, StringType())
papers_abs_fos_en = papers_abs \
.join(papersFos_L1, "PaperId") \
.withColumn("Lang", detect2(col("PaperTitle"))) \
.filter("Lang =='en'") \
.select("PaperId", "Rank", "PaperTitle", "RefCount", "CitCount", "FoSList")
It works, but it takes forever even on ca 10M titles. I am not sure if this is due to langdetect
, to UDFs or if I'm just doing something wrong, but I'd be grateful for any suggestion!
Thanks a lot! Paolo