I am trying to map a cef file to a data frame and ultimately to an output file. I'm getting RuntimeError: dictionary changed size during iteration
.
I've tried these solutions: 1, 2, 3, 4, etc. I'm not even sure where this dictionary is (in the lambda?) that the error is referring to. I do not believe this is a duplicate of the others as I am not explicitly using a dictionary anywhere in the code, so calling .keys()
or .items()
is not an option
I created a simple text file with the cef access and security events example:
I then ran the code below:
import pyspark
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
sc = SparkContext('local[2]','NetworkLog')
spark = SparkSession(sc)
target_data = sc.textFile('log.txt')
import re
def parse(str_input):
...
return values
parsed = target_data.map(lambda line:parse(line))
df = parsed.map(lambda x: (x['rt'],x['dst'],x['dhost'],x['act'],x['suser'],x['requestClientApplication'],x['threat name'],x['DeviceSeverity'],x['riskscore'])).toDF(['source_time','ip','host_name','act','suser','requestClientApplication','threatname','DeviceSeverity','riskscore'])
*parser found here
This may be a separate question, but sometimes the code breaks when values in parsed
are missing/null/0.0.0, so I'd also need a way to write null or 0.0.0 to the dataframe.