2

I have a spark code which reads .mf4 files and write as a .txt file. Also running this code in databricks.

Could you please suggest? Even after increasing the executor memory in data bricks in cluster, still having the issue.

pip install asammdf

from pyspark import SparkContext
from pyspark.sql import SparkSession
from asammdf import MDF
import io
import os
import sys
from pyspark.sql.functions import col



os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable


spark = SparkSession \
    .builder \
    .appName("mdf4") \
    .getOrCreate()

sc = spark.sparkContext

def decodeBinary(val):
    file_stream = io.BytesIO(val)
    mdf = MDF(file_stream)
    location = mdf.whereis(test_1)
    return location
print("1")
input_hdfs_path_to_mdf4 = "dbfs:/FileStore/inputmfd4/"
channel_name = "test_1"
local_or_hdfs_output_path = "dbfs:/FileStore/outputmfd4/opp4.txt"
print("2")
raw_binary = sc.binaryFiles(input_hdfs_path_to_mdf4)
print("3")
decoded_binary = raw_binary.map(lambda r: r[1]).map(decodeBinary)
print("4")
decoded_binary.saveAsTextFile(local_or_hdfs_output_path)
print("5")
print(decoded_binary)

I am running this code in databricks , I have 5Gb mf4 file as a input. When I try to run small file there is no issue. But when I use this 5GB mf4 file getting

Caused by org.apache.spark.api.python.PythonException: asammdf.blocks.utils.MdfException: <_io.BytesIO object at 0x7efed84482c0> is not a valid ASAM MDF file: magic header is b\xff\xd8\xff\xe1H\xe6Ex from command-2705692180399242 line 24 Full traceback below
Traceback (most recent call last)

1 Answers1

0

Can you print the first 100 bytes from the file? It is most likely not a valid MDF file

Edit: see iPhone JPG image has non-standard magic bytes ff d8 ff e1?

danielhrisca
  • 665
  • 1
  • 5
  • 11