The code mentioned below works as expected and returns the 8 records as shown below.
!cat stack_test1.csv
rowId,UserId,Date,Class
4,1,2008-07-31T21:42:52.667,696
6,1,2008-07-31T22:08:08.620,301
7,2,2008-07-31T22:17:57.883,463
9,1,2008-07-31T23:40:59.743,1941
11,1,2008-07-31T23:55:37.967,1556
12,2,2008-07-31T23:56:41.303,332
13,1,2008-08-01T00:42:38.903,633
14,1,2008-08-01T00:59:11.177,437
Is there any way to read first few records from the text file and save the csv to file1.csv and the rest in file2.txt? I do not want to split the final file. I need to read only first 3 or 4 lines from the source file because that file is very large. (around 80 GB)
!wget https://testme162.s3.amazonaws.com/test1.xml
!echo '</posts>' > last.txt
!cat test1.xml last.txt > /root/test2.xml
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd
file_path = r"/root/test2.xml"
dict_list = []
for _, elem in iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append({'rowId': elem.attrib['Id'],
'UserId': elem.attrib['PostTypeId'],
'Date': elem.attrib['CreationDate'],
'Class': elem.attrib['Score'] })
# dict_list.append(elem.attrib) # ALTERNATIVELY, PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
df.to_csv('stack_test1.csv', index=False)