pandas dataframe using stack overflow datafile

Question

The code mentioned below works as expected and returns the 8 records as shown below.

!cat stack_test1.csv

rowId,UserId,Date,Class
4,1,2008-07-31T21:42:52.667,696
6,1,2008-07-31T22:08:08.620,301
7,2,2008-07-31T22:17:57.883,463
9,1,2008-07-31T23:40:59.743,1941
11,1,2008-07-31T23:55:37.967,1556
12,2,2008-07-31T23:56:41.303,332
13,1,2008-08-01T00:42:38.903,633
14,1,2008-08-01T00:59:11.177,437

Is there any way to read first few records from the text file and save the csv to file1.csv and the rest in file2.txt? I do not want to split the final file. I need to read only first 3 or 4 lines from the source file because that file is very large. (around 80 GB)

!wget https://testme162.s3.amazonaws.com/test1.xml
!echo '</posts>' > last.txt
!cat test1.xml last.txt > /root/test2.xml

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/root/test2.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['PostTypeId'],
                          'Date': elem.attrib['CreationDate'],
                          'Class': elem.attrib['Score'] })

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)
df.to_csv('stack_test1.csv', index=False)

You do have a lot of reputation and I would have assumed you did a check on SO. There are many examples. [1](https://stackoverflow.com/questions/52968877/read-xml-file-to-pandas-dataframe) and [2](https://stackoverflow.com/questions/50774222/python-extracting-xml-to-dataframe-pandas) and [3](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) — Joe Ferndz, Jan 18 '21 at 09:37
Here's an example of parsing large xml file on [stack overflow](https://stackoverflow.com/questions/62578671/large-xml-file-parsing-in-python) — Joe Ferndz, Jan 18 '21 at 09:41
ok. I am using that code. How much time will it take for 81 GB file? (assuming it will complete) https://stackoverflow.com/questions/62578671/large-xml-file-parsing-in-python — shantanuo, Jan 18 '21 at 09:52
Not able to use dask as shown in this article. Getting an error "method not found" https://www.pluralsight.com/tech-blog/data-processing-with-dask/ — shantanuo, Jan 18 '21 at 09:54

pandas dataframe using stack overflow datafile

0 Answers0