1

I have opened this large XML file and isolated the dates. I have converted each specific date into an array, but I want to place them into a single array and after that sort them.

Here is the code:

import numpy as np

with open('dblp-2020-04-01.xml','r' , encoding="ISO-8859-1") as f:
   for i, line in enumerate(f):
    if "<year>" in line:

        data = line[6:10]
        data_list = np.array([data])
        print(data_list)

The desired output is:

['2010']
['2002']
['1992']
['2002']
['1994']
  ...
user78910
  • 349
  • 2
  • 12

1 Answers1

0

You need to create a np.array outside of your for-loop and finally append all the dates to it:

with open('dblp-2020-04-01.xml', 'r', encoding="ISO-8859-1") as f:
    data_list = np.array([])
    for i, line in enumerate(f):
        if "<year>" in line:
            data = line[6:10]
            data_list = np.append(data_list, data)
    print(data_list)

And the output will be

['2010', '2002', '1992', '2002', '1994']

Finally, you can sort your array using numpy.sort():

np.sort(data_list) # Ascending order
>>> ['1992', '1994', '2002', '2002', '2010']

UPDATE

Okay so given the scenario you are describing in the comments, I would say that the most efficient way to get a count per date from your XML data, is to load the XML into a pandas dataframe and finally use

df['yourDatesColumn'].groupby('yourDatesColumn').count()

or

df['yourDatesColumn'].value_counts()

in order to get the counts per date.

Alternatively, you can choose to create a pandas.Series object just for your date column (in case you don't want to load all the data to a pandas Dataframe).

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156