Saturday, December 5, 2015

Reliable SFTP file upload without duplicates

SFTP is not the most fancy technology, but there are quite many areas where it's used. Some systems use it for exchanging messages. In this scenario one system generates a file and uploads it to SFTP. Another system scans SFTP for new files and processes them. Processed files are removed from incoming folder. Simple, unless you start thinking about error handling.

File upload can be easily interrupted. In this case remote system will pick up partially uploaded file. If you're lucky, remote system is smart enough to detect corrupted files and handle them accordingly, but what if not? Trick is to use temporary file.

Temporary file is not picked up by remote system. If upload crashes, we can resume it or even restart whole upload. When we're sure that temporary file was uploaded, it's time to rename or move it to proper location. Now remote system can pick it up.

Let's imagine that remote system is even less reliable and does not recognize duplicate files. Now we need to make sure that we will never rename temporary file to proper name twice. Unfortunately, if network will drop off before we will receive answer from rename operation. We can verify state by checking whether temporary file still exists and retry renaming if it does.

It is important to persist fact of successful upload of temp file. In case of transactional persistence, make sure that it won't be rolled back.

Here is a flow that worked well for us:


After several months in production we did not have any issues with that solution.

Choosing partner to develop MVP for startup

Probably best option for startup is when founders can develop MVP on their own, but what if not? In such case, outsourcing companies are ready to offer some help. Here are some points which seem to be important for choosing such development partner.

  • Maturity of team - in case of MVP for startup, there is not much time to invest into forming team. Forming and storming phases can easily take a month. During that period team productivity will be reduced. It’s ok for long term cooperation, but would be nice to avoid when short time result is important.
  • Team member profiles - nothing will happen without qualified team members. It’s ok to have different levels in team, but team lead has to guarantee result. Would be nice to have a team with 1 senior and remaining medium level developers. Too many seniors spend too much time on theory. Juniors waste time of seniors. Team lead that had previous experience with startups is a big benefit as he might know which technical compromises are ok in short term and which will bite immediately.
  • Knowledge sharing inside of team - team should have a plan to cover unexpected absence of any team member. People get sick, go to vacation, change projects.
  • Experience with required technologies - would be nice to make sure that team members have expertise in core technologies that will be used in project. It’s ok to have some minor innovation, but bigger research can take too much time and lead to moderate results.
  • Dedication - team that has too many obligations from previous projects/clients can be distracted too often. Would be nice to know if team has to maintain previous projects and in what amount.
  • Schedule - as MVP has quite strict deadlines, most outsourcing companies won’t be able to provide well established team that fast.
  • Flexibility - contract should allow introducing changes to requirements. Would prefer time-and-material to fixed price, as value of project done by fixed price can be much lower. Also it would be nice to know about team plans after expected delivery date. It is possible that team has next project scheduled and it will be problematic to finish started MVP and resume works in case of MVP success.
  • UX capabilities - if MVP includes UX, good cooperation with designer is a big benefit. If team has worked with this designer previously, there will know how to play nice together, otherwise sides can pull project into different directions.
  • Mobile experience - if project requires mobile development, would be ideal to get it from same partner.
  • Billing and pricing models - need to know what side activities of team members will be included into project time.
  • Price - should take all previously mentioned points into account, as it might turn out that highest price per hour will lead to lowest total price.

Friday, August 28, 2015

Handling multi line files in Logstash

Logstash is a nice tool for processing your logs. Love it for its flexibility and variety of work flows, but this variety has downsides.

When you first try it out, everything seems to work fast, but real stuff begins when you start processing huge amounts of logs from many files. Probably at that point you will Google how to speed up Logstash and will find some suggestions to increase amount of workers to utilize CPU. Great, now we have multi threading. Unfortunately, now we also have thread safety issues. Or at least one big issue with multi line logs.

There are actually two issues around multi line processing in Logstash: https://github.com/logstash-plugins/logstash-filter-multiline/issues/12 and https://github.com/logstash-plugins/logstash-input-file/issues/44. First means that you can not use multi line filter as soon as you enable more workers. Second - each file requires own input configuration. In our system it means hundreds of files. This does not scale at all.

Probably at some point those issues will be resolved, but until then, following setup can boost performance of your Logstash.

Trick is to split processing into two phases - first join multi line entries and then parse them. This can be achieved by setting up 2 Logstash instances. First takes input from files, processes them with multi line filter and sends result to Redis. Second takes input from Redis, applies rest of filters and sends output to ElasticSearch. Due to mentioned issues, first instance is limited to 1 worker. Second can scale by adding more workers. Redis serves as buffer.

In our case, performance boost was about 4 times, from 60k to 200k log entries per minute, even without adding more workers to second instance. Also now we can add more rules for parsing logs. Unfortunately, looks like multi line still is the bottleneck and most probably we would have to introduce more multi line processing instances and split their responsibilities.

Sample Logstash configuration:


Integration tests in SpringBoot without external dependencies

Target - test all layers of SpringBoot REST application in isolation from external components.

Complexity - typical application uses database and makes calls to remote services.

Solutions:

a) Use spring-test support for integration tests. It actually starts whole app for you on random port. At the same time, all components of app can be wired into test for additional manipulations.

b) Use RestAssured to make calls to our application

c) Use RestTemplate to make remote calls and MockServer from spring-test to mock them

d) Use in-memory database. HSQLDB does the simulating job pretty well.

e) Use separate Spring profile to tweak app configuration


Here is a small example putting it all together:


Sunday, August 9, 2015

Hidden exceptions

This is the story about consequences of eliminating checked exceptions in Groovy. All observations made on relatively big project, around 60 developers.

Observation 1 - catch them all :)

In some cases it was Exception, in some Throwable (like you can recover from OutOfMemory). As a result, non-recoverable exceptions are often treated as recoverable and otherwise. They are logged with same level and escape bug tracking system.

Observation 2 - invent complex return types


If exceptions can't be used as class API, developers start inventing complex return types that wrap value together with response status. This had several consequences: 

1) propagation of this status is a pain. Just imagine calling several methods and checking whether there was an issue after each call. 




2) Many transaction definitions were broken. They were handled by Spring and were supposed to be rolled back on exception, but method just returns error code, no exception.
3) No stack traces. Often this was a result of "catch them all -> return error code. The only way to figure out real cause is to debug.

As a conclusion, I would advice to become friends with exceptions and try to avoid cowboy style languages. There are definitely areas where Groovy rocks, like scripting, but don't try to push it everywhere. Keep in mind that coding is easy and does not take much time. Debugging does.