0

When i put a file in the local directory (vagrant/flume/test.csv), in HDFS flume turns it into (/user/inputs/test.csv.1591560702234) ,i want to know why HDFS adds 1591560702234 and how to remove it !

this is my flume.conf file

# Flume agent config
a1.sources = r1
a1.sinks =  k2
a1.channels = c1

a1.channels.c1.type = file
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.basenameHeader = true
a1.sources.r1.spoolDir = /vagrant/flume

a1.sinks.k2.type = hdfs
a1.sinks.k2.channel = c1

a1.sinks.k2.hdfs.filePrefix = %{basename}
a1.sinks.k2.hdfs.writeFormat = Text
#a1.sinks.k2.hdfs.fileSuffix =
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.path = /user/inputs/

a1.sinks.k2.rollInterval = 0
a1.sinks.k2.rollSize = 0
a1.sinks.k2.rollCount = 0
a1.sinks.k2.idleTimeout = 0

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k2.channel = c1
user4157124
  • 2,809
  • 13
  • 27
  • 42
yanis
  • 1
  • 2

1 Answers1

0

Flume add the time in milliseconds. From your example:

select from_unixtime(ceil(1591560702234 / 1000));
+----------------------+--+
|         time         |
+----------------------+--+
| 2020-06-07 22:11:43  |
+----------------------+--+

I think it's not possible to remove the timestamp with flume configuration.

But you could add a Suffix with hdfs.fileSuffix. From the documentation:

hdfs.fileSuffix –   Suffix to append to file (eg .avro - NOTE: period is not automatically added)

You could also put more events in a single file with some flume properties

please check

  • batchSize
  • rollSize
  • rollTime
  • rollCount

You could also merge directories with HDFS commands.

getmerge
Usage: hadoop fs -getmerge [-nl] <src> <localdst>

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.

Examples:

hadoop fs -getmerge -nl /src /opt/output.txt
hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt
Chema
  • 2,748
  • 2
  • 13
  • 24
  • Thanks for your reply, but When i add a suffix it remains the same, for example if i add .avro as a suffix in hdfs the file will be test.csv.157757657.avro . The time in milliseconds keeps showing – yanis Jun 08 '20 at 13:34
  • I've been doing some research and I changed my answer. Hope it can be helpful. – Chema Jun 09 '20 at 17:13
  • thanks again but i am looking for a flume solution, and i figured out that the inspector timestamp of hadoop sinks is the origin of this addition, and that someone has already asked this question here: [link](https://stackoverflow.com/questions/33820163/flume-hdfs-sink-remove-timestamp-from-filename) **And i believe that there is no solution to this**. – yanis Jun 09 '20 at 19:20
  • 1
    Well, as you have said, there is no flume solution, but you could make up a script file to rename files and merge directories, and put it in a task planner, I think it can be a possible solution. – Chema Jun 09 '20 at 20:22